Low-dimensional representation of Gaussian mixture model supervector for language recognition

In this article, we propose a new feature which could be used for the framework of SVM-based language recognition, by introducing the idea of total variability used in speaker recognition to language recognition. We consider the new feature as low-dimensional representation of Gaussian mixture model supervector. Thus we propose multiple total variability (MTV) language recognition system based on total variability (TV) language recognition system. Our experiments show that the total factor vector includes the language dependent information; what's more, multiple total factor vector contains more language dependent information. Experimental results on 2007 National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) databases show that MTV outperforms TV in 30 s tasks, and both TV and MTV systems can achieve performance similar to that obtained by state-of-the-art approaches. Best performance of our acoustic language recognition systems can be further improved by combining these two new systems.


Introduction
The aim of language recognition is to determine the language spoken in a given segment of speech. It is generally believed that phonotactic feature and spectral feature provide complementary cues to each other [1,2]. Phone recognizer followed by language models (PRLM) and parallel PRLM (PPRLM) approaches that use phonotactic information have shown very successful performance [2,3]. The acoustic method which uses spectral feature has the advantage that it does not require specialized language knowledge and is computationally simple. This article focuses on the acoustic component of the language recognition systems. The spectral features of speech are collected as independent vectors. The collection of vectors can be extracted as shifted-delta-cepstral acoustic features, and then modeled by Gaussian mixture model (GMM). The result was reported in [4]. The approach was further improved by using discriminative training that is named maxi-mum mutual information (MMI). Several studies use support vector machine (SVM) in language recognition to form GMM-SVM system [5,6]. In language recognition evaluation, MMI and GMM-SVM are primary acoustic systems.
Recently, total variability approach has been proposed in speaker recognition [7,8], which uses the factor analysis to define a new low-dimensional space that is named total variability space. In contrast to classical joint factor analysis (JFA), the speaker and the channel variability are contained simultaneously in this new space. The intersession compensation can be carried out in lowdimensional space.
Actually, we can consider total variability approach as a classical application of the probabilistic principal component analysis (PPCA) [9]. The factor analysis of the total variability approach can obtain useful information by reducing the dimension of the space of GMM supervectors. That is all utterances could in fact be well represented in a low-dimensional space. We believe useful language information can be obtained by similar front-end processes. Therefore we try to introduce the idea of total variability to language recognition. We estimate the language total variability space by using the dataset shown in Section 5, and we suppose that a given target language's entire set of utterances is regarded as having been belonging to different language. Then, the total factor vector is extracted by projecting an utterance to the language total variability space. As in speaker recognition, intersession compensation can also be performed well on low-dimension total factor vector. In our experiments, two intersession compensation techniques-linear discriminant analysis (LDA) [6] and locality preserving projection (LPP) [10][11][12]-are used to improve the performance of language recognition.
In some previous studies [13,14], rich information is obtained by using multiple reference models, such as male and female gender-dependent models in speaker recognition. Generally, there are abundant language data for each target language in language recognition, and the number of target languages is limited. Based on TV language recognition system [12,15], we propose MTV language recognition system where we use languagedependent GMMs instead of universal background model (UBM) in the process of language total variability space estimation and total factor vector extraction. Our experiments show that total factor vector (TV system) includes the language dependent information; what's more, multiple total factor vector (MTV system) contains more language dependent information.
This article is organized as follows: In Section 2, we give a simple review of total variability, support vector machines, and compensation of channel factors. In Section 3, we apply total variability in language recognition. In Section 4, the proposed language recognition system is presented in detail. Corpora and evaluation are given in Section 5. Section 6 gives the experimental results. Finally, we conclude in Section 7.

Total variability in speaker recognition
In speaker recognition, unlike in classical joint factor analysis (JFA), the total variability approach defines a new low-dimensional space that is named total variability space, which contains the speaker and the channel variability simultaneously. The total variability approach in speaker recognition relaxes the independent assumption between speaker and channel variability spaces in JFA speaker recognition [16].
For a given utterance, the speaker and channel variability dependent GMM supervector is denoted in Equation (1).
where m ubm is the UBM supervector, T is total variability space, and the member of the vector w is total factor. We believe useful language information can be obtained by similar front-end process. Thus we try to apply total variability in language recognition.

Support vector machines
SVM [17] is used as a classifier after our proposed front-end process in language recognition system. An SVM is a two-class classifier constructed from sums of a kernel function K(,): where N is the number of support vectors, t i is the ideal output, a i is the weight for the support vector x i , a i >0 and N i=1 α i t i = 0. The ideal outputs are either 1 or -1, depending upon whether the corresponding support vector belongs to class 0 or class 1. For classification, a class decision is based upon whether the value, f (x), is above or below a threshold.

Compensation of channel factors
Compensating the variability from changes in speaker, channel, gender, and environment are the key for the performance of automatic language recognition systems. In our proposed front-end process, the process of an intersession compensation technique in spectral feature domain is still adopted, which has been proposed for speaker and language recognition in [18,19]. The adaptation of the feature vectorô (i) (t) is obtained by subtracting from the original observation feature a value that is a weighted sum of the intersession compensation offset values.
where g m (t) is the Gaussian posterior probability of each Gaussian mixture m of the universal background model (UBM) for a given frame of an utterance. U m and y (i) are about the intersession compensation related to the mth Gaussian of UBM. U m is intersession subspace and y (i) is channel factor vector. In our proposed language recognition system, we use spectral feature after compensation of channel factors.

Applying total variability in language recognition
There is only one difference between total variability space T estimation and eigenvoice space estimation in speaker recognition [8,20]. All the recordings of a speaker are considered as to belong to the same person in the eigenvoice estimation. However, in the total variability space estimation, a given speaker's entire set of utterances is regarded as having been produced by different speakers. If we suppose that a given target language's entire set of utterances is regarded as having been produced by different languages, a common pool of hidden variables acts as basis factors and represents the utterances from different languages. Then, the process of language total variability space estimation is exactly the same as the process of total variability space estimation and eigenvoice space estimation in speaker recognition. The process is an iterative algorithm [21]. The use of the data which is the only difference is critical. Therefore, we suggest that all utterances of each target language had better be used to estimate language total variability space.

Language total variability space estimation
For a given utterance, the language and channel variability dependent GMM supervector can also be denoted as Equation (1), because the process of language total variability space estimation is exactly the same as the process of total variability space estimation and eigenvoice space estimation in speaker recognition. We can consider the total factor vector model as a new feature extractor that projects an utterance to a low rank space T to get a language and channel variability dependent total factor vector w. Space estimation can be implemented by an iterative algorithm [21].

Language-dependent total variability space estimation
In language total variability space estimation, total variability space is estimated relative to UBM, which is language, speaker, channel, gender, and environment independent. Some previous studies [13,14] show that rich information can be obtained by using multiple reference models. These studies suggest the possibility of using language-dependent GMM instead of languageindependent UBM in language total variability space estimation. We call language total variability space language-dependent total variability space when the total variability space is related to language-dependent GMM.
First, we train GMM model for each target language. For L target languages, we train a GMM language model for each target language using maximum likelihood (ML) [22]. Then L language-dependent total variability spaces are estimated by using those language dependent GMMs instead of language-independent UBM. An utterance is projected to L different T to get L total factor vectors; as an example, the total factor vector according to Mandarin GMM is illustrated by Equation (4). We combine L total factor vectors to obtain one big multiple total factor vector as Equation (5).

Intersession compensation
After the new feature extractor, the intersession compensation can be carried out in low-dimensional space.
In our experiment, we use the linear discriminant analysis (LDA) approach and locality preserving projection (LPP) approach for intersession compensation.

Linear discriminant analysis
All of the total factor vectors of the same language are recorded as the same class in linear discriminant analysis.
By LDA transformation in Equation (6), the total factor vector w is projected to new axes that maximize the variance between languages and minimize the intra-class variance. The matrix A is trained by using the dataset shown in Section 5, and the matrix A is contained of the eigenvectors of Equation (7).
where l is the diagonal matrix of eigenvalues. ν is the eigenvector corresponding to the non-zero eigenvalue. The matrix S b is the between class covariance matrix and S w is the within class covariance matrix.

Locality preserving projection
Locality preserving projection (LPP) [10,11] is different from LDA which effectively preserves global structure and linear manifold. LPP considers the manifold structure which is modeled by a nearest-neighbor graph. LPP can gain an embedding that preserves local information. In this way, the variability resulting from changes in speaker, channel, gender, and environment may be eliminated or reduced. Thus LPP can be used for intersession compensation.
By LPP transformation matrix A LPP in Equation (8), the total factor vector w is projected to w ' to preserve local structure of the total factor vector.
First, for training LPP transformation matrix, we construct the nearest-neighbor graph. Let G denotes a graph with m nodes. The ith node corresponds to the total factor vector w i . We put an edge between nodes i and j, while i is among k nearest neighbors of j, or j is among k nearest neighbors of i. In this article, k is set to be 5. If nodes i and j are connected, let The justification for this choice of weights can be traced back to [23].
Then, we compute the eigenvectors and eigenvalues for generalized eigenvector problem: where D is a diagonal matrix whose entries are column sums of E, D ij = ∑ j E ji . L = D -E is the Laplacian matrix. The ith row of matrix W is w i . Let a 0 , a 1 ,..., a τ-1 be the solution to (10), ordered according to their eigenvalues, 0 ≤ θ 0 ≤ θ 1 ≤ ... ≤ θ τ-1 . Thus, the LPP transformation matrix is as follows:

The proposed language recognition system
The proposed TV and MTV language recognition systems contain three main processes, spectral feature extraction, total factor vector extraction, SVM model and language score calibration. Figure 1 shows the proposed TV and MTV language recognition systems, which contain the three main processes. In Figure 1, the alphabet W is the member of the total factor vector w. N is the dimension of each total factor vector w. GMM1, GMM2, ... , GMML are Gaussian mixture models for each target language.

Spectral feature extraction
The spectral feature in the system is 7 Mel-frequency cepstral coefficients (MFCC) concatenated with shifteddelta-cpectral (SDC) N-d-p-k feature, where N = 7, d = 1, p = 3, and k = 7, which is in total 56-dimension coefficients each frame. This representation is selected based upon prior excellent results with this choice, and the improvement of adding direct coefficients with the C0 coefficient in this feature vector was studied in [24]. In this article, spectral feature refers to this 56-dimension feature. Nonspeech frames are eliminated after speech activity detection and 56-dimension spectral feature is extracted. Then feature warping [25] and cepstral variance normalization are applied to the previously extracted spectral feature such that each feature is normalized to mean 0 and variance 1.

Total factor vector extraction
In our system, spectral feature after compensation of channel factors is used. First, language total variability space and language-dependent total variability spaces are estimated. Then, we extract total factor vector as shown in Figure 1. In our experiments, the number of mixtures of UBM (or GMM) is 1024, and total variability space T is a rectangular matrix of low rank with dimension 1024*56 by 400. The dimension of w is 400.
The total factor vector w is a hidden variable, and can be obtained as follows [8]: Figure 1 The proposed TV and MTV language recognition systems. We define N(u) as a diagonal matrix whose diagonal blocks are N c I. F(u) is a supervector obtained by concatenating all first-order Baum-welch statisticsF c for an utterance u. Σ is a diagonal covariance matrix estimated during factor analysis training [20] and T is language total variability space. N c andF c are defined as follows: where L is the frames, c is the Gaussian index of C mixture Gaussian components, P (c/y t , Ω) corresponds to posterior probability of mixture component c generating the vector y t , and, m c is the mean of UBM mixture component c.
Multiple total factor vector is extracted with similar method by using language-dependent GMM instead of language-independent UBM and using language-dependent total variability space instead of language total variability space as in Equation (4). Then, the multiple total factor vector w MTV is a combination of w 1 , w 2 , . . . , w mandarin , . . . , w L as shown in Figure 1 and Equation (5). Actually, in multiple total variability language recognition system, the combination of total factor vectors is implemented after intersession compensation which is shown in Section 3.3.

SVM model and language score calibration
Total factor vectors and multiple total factor vectors are used as SVM features in our proposed TV and MTV systems. Our experiments are implemented by using the SVMTorch [26] with a linear inner-product kernel function.
Calibrating confidence scores in multiple-hypothesis language recognition has been studied in [27]. We should estimate the posterior probability of each hypotheses and make a maximum a posterior decision. In standard SVM-SDC system [6], log-likelihood ratios (LLR) normalization is applied as a simple backend process and is useful. Suppose S = [S 1 . . . S L ] t is the vector of L relative log-likelihoods from the L target languages for a particular utterance. Considering a flat prior, a new log-likelihood normalized score S i is denoted as: A more complex full backend process is given [6,28], LDA and diagonal covariance Gaussians are used to calculate the log-likelihoods for each target language and achieve improvement in detection performance. This process transforms language scores with LDA, models the transformed scores with diagonal covariance Gaussians (one for each language), and then applies the transform in Equation (15).
In this article, the backend process of the LDA and diagonal covariance Gaussians is used in language recognition system, because the backend process of the LDA and diagonal covariance Gaussians is superior to log-likelihood ratios normalization in our experiments.

Corpora and evaluation
The experiments are performed using the NIST LRE 2007 evaluation database. There are 14 target languages in the corpora used in this article: Arabic, Bengali, Chinese, English, Farsi, German, Hindustani, Japanese, Korean, Russian, Spanish, Tamil, Thai, and Vietnamese. The task of this evaluation was to detect the presence of a hypothesized target language for each test utterance. The training data were primarily from Callfriend corpora, Callhome corpora, Mixer corpora, OHSU corpora, OGI corpora, and LRE07Train. The development data consists of LRE03, LRE05, and LRE07Train. We use equal error rate (EER) and the minimum decision cost value (minDCF) as metrics for evaluation.

Experiments
First, total variability language recognition system (TV) is experimented, then exports to multiple total variability language recognition system (MTV). Table 1 shows the results of the MMI system, the GMM-SVM system and the TV and MTV systems with the intersession compensation techniques of LDA and LPP. EER and minDCF are observed. With the performance comparison, it is observed that the two intersession compensation techniques of LDA and LPP is effective for TV and MTV systems. The performance is improved obviously when we use LDA and LPP simultaneously. That is models with LDA and models LPP are simultaneously used to score all test utterance. Therefore we regard TV and MTV systems with LDA and LPP simultaneously as our lastly proposed TV and MTV systems. It is observed that the proposed TV and MTV systems achieve performance similar to that obtained by state-of-the-art approaches, which demonstrates that our proposed systems are feasible. Then, we compare the results of TV system to MTV system with the same intersession compensation technique. We can see that the system based on MTV produces better performance than TV. It says multiple total factor vector contain more language-dependent information. In our language recognition systems for NIST 2007 LRE in 30s tasks, the MTV system performs best. Table 2 shows the results of the combination of the MMI system, the GMM-SVM system, the TV system, and the MTV system, in terms of EER and minDCF. As we know, system fusion can exploit partial error decorrelations among the individual systems allowing for performance gains over the separate systems. In language recognition evaluation, MMI and GMM-SVM are primary acoustic systems. Generally, the combination of the MMI system and the GMM-SVM system is the given performance of acoustic system. Table 1 shows that our proposed TV and MTV systems have been effective. We believe that the TV and MTV systems contain different language information comparing to state-of-the-art systems, because total factor vector and multiple total factor vector are new features for language recognition. Thus we expect the TV and MTV system can benefit the performance of combined system. It leads to a relative improvement of 8.1% in EER and 16.5% in minDCF combining TV system with the MMI and GMM-SVM systems. Further more, we obtain relative improvement of 12.3% in EER and 11.4% in minDCF by adding MTV system to the combined system of the MMI, GMM-SVM, and TV systems. In all, the two systems lead to relative improvement of 19.4% in EER and 26.0% in minDCF comparing to the performance of the combination of the MMI and GMM-SVM systems. Figure 2 shows DET curves of the MMI system, GMM-SVM system, the TV system and the MTV system. DET curves of the combination of each system are also shown in Figure 2. It is observed that the relative improvement of language recognition performance is observable with our proposed approaches.

Conclusions
In this article, multiple total factor vector are proposed for language recognition based on using total factor vector in language recognition. Our experiments show that total factor vector includes the language dependent information. Further more, multiple total factor vector contains more language dependent information. Comparing to popular acoustic system (MMI and GMM-SVM system) in language recognition, those two new language features contain different language dependent information. We believe it is attractive that our proposed features can improve our best acoustic performance of the combination of the MMI and GMM-SVM systems. In our future study, different approaches of intersession compensation will be carried on the new features. Table 2 Results of the combination of MMI system and GMM-SVM system, and the combination of the MMI system, GMM-SVM system, TV system, and MTV system on the NIST LRE07 30 s corpus