Using Mel-Frequency Cepstral Coefﬁcients in Missing Data Technique

Filter bank is the most common feature being employed in the research of the marginalisation approaches for robust speech recognition due to its simplicity in detecting the unreliable data in the frequency domain. In this paper, we propose a hybrid approach based on the marginalisation and the soft decision techniques that make use of the Mel-frequency cepstral coe ﬃ cients (MFCCs) instead of ﬁlter bank coe ﬃ cients. A new technique for estimating the reliability of each cepstral component is also presented. Experimental results show the e ﬀ ectiveness of the proposed approaches.


INTRODUCTION
In spite of many years of efforts, the robustness of speech recognition in the noisy environment is still a fundamental unsolved issue in today's automatic speech recognition (ASR) systems. Recently, missing data theory [1,2,3,4] is proposed as an operationalization to improve the robustness of the ASR decoding process. Experimental results show that it can significantly restore the ASR performance with little prior assumptions made about the characteristics of the environment noises. However, most of the previous marginalisation approaches are only derived and tested for the filter bank features due to the convenience of detecting the unreliable data in the frequency domain. Most often, cepstral features are the parameterisation of choice for many speech recognition applications. For example, the Mel-frequency cepstral coefficient (MFCC) [5] representation of speech is probably the most commonly used representation in speech recog-nition and recently being standardized for the distributed speech recognition (DSR) [6]. Generally, cepstral features are more compactible, discriminable, and most importantly, nearly decorrelated such that they allow the diagonal covariance to be used by the hidden Markov models (HMMs) effectively. Therefore, they can usually provide higher baseline performance over filter bank features. Applying missing data techniques to cepstral features is obviously attractive and natural.
Unfortunately, while decorrelating, the cepstral transform also smears localized spectral uncertainty over global cepstral uncertainty. This defect dose not only bring the difficulty to the detection of the unreliable cepstral components but also seems to contradict the basic assumption of missing data theory that some part of the feature vector should be untainted by the noise [4]. However, when the distortions are not too severe, there will be some cepstral components that are less affected and can provide correct discrimination information while using the clean speech models. If we regard these components as reliable data, then the marginalisation approach should also be applied to the cepstral features. Its performance will depend on how severely the noise distorts the cepstral feature. Fortunately, it can be seen that even the full band features that smear distortions over the entire vector are much more affected by band-limited noises than those features that localize the spectral distortions, they do perform well in many full band noises. This phenomenon is also reported in [7,8,9]. It means that in many cases, the full band features are not more affected by the noise than the subband ones. Therefore, it can be expected that the cepstral marginalisation will also perform well under such situations.
To implement the cepstral marginalisation approach, we propose a new technique to evaluate the reliability of each feature component in the Mel-cepstrum domain. Two criteria for detecting the reliable cepstral components are presented and combined together to form a more accurate joint decision. Then the marginalisation approach is applied to the MFCCs by using this combined criterion. Based on the proposed cepstral marginalisation approach, a cepstral soft decision approach is also developed to further improve the robustness of the MFCC recognizer.

Detection of the reliable cepstral features
The major difficulty of the cepstral marginalisation is how to determine the reliable/unreliable components of the speech data. In this paper, we propose two ways to estimate the influence of noises on the cepstral component. One is based on the speech enhancement and the other is based on a noise mask model. By setting the threshold, a criterion for selecting the reliable data can be obtained from each method. After that, we combine these two criteria together and propose a soft technique to determine the final reliable/unreliable decision for each cepstral component.
Assume that the noise is added in the time domain. Let c y (i) and c x (i) denote the ith MFCC components of the noisy speech and the clean speech, respectively, where 1 ≤ i ≤ I, I is the dimension of the MFCC vector. Then c y (i) can be expressed as follows: where c n (i) can be viewed as the noise in the cepstrum domain. If c n (i) can be estimated, then the impact of the noise to the clean feature can also be determined. Let Y ( j), X( j), and N( j) denote the jth filter bank outputs of the linear power spectra of the noisy speech, clean speech, and noise, respectively. Then c n (i) can be expressed as where 0 ≤ j ≤ J − 1, J is the number of filter bank channels, and a i j is the DCT coefficients. Using some kinds of the enhancement techniques like the spectral subtraction, X( j) can be estimated, so the estimation of c n (i) can be given by the following:ĉ whereX( j) andĉ n (i) denote the estimation of X( j) and c n (i). Whenĉ n (i) is larger than a given threshold, c y (i) can be regarded as unreliable. So the first criterion for choosing a reliable component can be given by the following: Obviously, speech enhancement algorithms cannot always give accurate estimations of the clean features, especially when the SNR is low. It can be seen that an unreliable component with a smallĉ n (i), which is caused by the inaccuracy of the enhancement, cannot be detected using (4). To overcome this defect, we propose to use another method to estimate the influence of the noises. For additive noises, c y (i) can be expressed as Assume that either the clean speech or the noise will dominate in each filter bank channel and the channel output can be approximated to the dominating one. For each channel, a threshold can be applied to determine which signal is dominating. Then Y ( j) can be expressed as whereN( j) is the estimation of the noise, α is an empirical threshold factor which can be determined in the experiment. Substituting (6) in (5), we have the following: According to (7), another criterion for choosing the reliable components can be given by Combining (8) with (4), the unreliable components with a smallĉ n (i) can also be detected. It is more accurate to use a joint decision than an individual one. We can simply adopt an "and" operation to achieve such a decision, that is, a component will be considered as reliable when conditions (4) and (8) are satisfied.

Detection of the reliable delta cepstral coefficients
In traditional ASRs, the time derivatives are usually added to the static parameters to enhance the recognizer performance. The marginalisation approach can also be applied to these coefficients. In the filter bank marginalisation, one solution to this problem is called the "strict mask" [10]. It treats the derivatives as missing if any of the features involved in their calculations are missing. The strict mask is sufficient for filter bank features because the reliable features tend to be clustered into time-frequency blocks. However, it may not be feasible for cepstral features since the missing mask pattern is more random. Applying the strict mask will cause the sparseness of the reliable derivatives, thus, we propose to use another way to detect the reliable derivatives. It is also based on the combination of the enhancement and noise mask methods that are described in Section 2.1.
Usually, the delta coefficients can be calculated using the following expression: The noise of the delta cepstral coefficients can be expressed as When the cepstral noiseĉ n (i) is estimated using the enhancement, ∆ĉ n (i) can be given as So, one criterion for choosing the reliable delta cepstral components can be given by On the other hand, with the noise mask approximation, ∆c y (i) can be expressed as So, another criterion for choosing the reliable delta cepstral components can be given by Combining these two criteria, a delta cepstral component can be decided as reliable when conditions (12) and (14) are satisfied.

Marginalisation
Using (4), (8) and (14), the reliable cepstral and delta cepstral components can be picked out from the whole feature vectors. For the continuous density HMM (CDHMM) recognition system with diagonal-only covariance, the marginalised probability of observations can be given by where x is the observation vector, C m is the mth state of the HMM model, w mn is the weight factor associated with the nth Gaussian component of the state C m , and µ mn and σ 2 mn are the mean and variance of the Gaussian PDF.

Noisy speech model
Due to the cepstral transformation, even a little noise that exists in some frequency bands will affect all the feature components. So, in a noisy environment, each cepstral component will always have a portion of the noise in the clean speech.
Obviously, it is more sensible to adjust the weight of each component according to its influence level than using a binary decision of reliable or unreliable. Given a noisy observation, the components that are less affected by the noise will have distributions close to the clean ones while those severely affected will be more uncertain and might have much different characteristics. According to [4], the distribution of a noisy observation can be modeled as a weighed sum of a known distribution that is obtained during the training process and an unknown distribution for the uncertain data. We model the noisy speech in a similar way. While using the diagonal-only covariance, the probability of a noisy observation can be given by where p 1 (x i |C m , n) denotes the clean distribution as and where p 2 (x i ) denotes the distribution of uncertain data.
When no prior knowledge about this distribution is available, it can be assumed that the uncertain data will have a uniform distribution in the range of values observed during training as where x i,max and x i,min are the maximum and minimum values of the ith component observed in the training data.
In the acoustic backing-off approach, ε i refers to the prior probability of observing a reliable datum and needs to be determined in advance. It is obviously that this assumption is not suitable for real world applications. Instead of setting up a static value in advance, we adjust ε i according to the noise level of each cepstral component. These levels are estimated using the two methods described in Section 2.

Weights adjustment
Let ε i and ε i denote the weights for the ith cepstral and delta cepstral components, respectively. Using the enhancement method, we can adjust them by Using the noise mask method, the weights can be adjusted as These weights can also be combined together to improve the performance. We calculate the combined weights by

EXPERIMENTS
Clean speech data for training and testing are taken from the TI46 speaker-dependent isolated word corpus. Digits 0-9 spoken by all male speakers are used. There are 26 utterances of each digit from each speaker: 10 of these utterances are designated as training tokens and the other 16 are designated as testing tokens. Speech data are sampled at 12500 Hz and linearly quantified with 12 bits. Four noises from the NOISEX-92 [11] database with distinct characteristics: white noise, F16 noise, pink noise, and factory noise, are artificially added to the clean speeches with different SNRs. Each digit is modeled by an HMM which composes of five no-skip straight-through emitting states. Each state has three diagonal Gaussian mixtures. Both filter bank coefficients and MFCCs are used in the experiments. Input speeches are segmented into overlapping frames with 25 milliseconds length and 10 milliseconds shift. Twenty triangular filters are uniformly distributed on a Mel-frequency scale and their log energy outputs form the 20-dimension filter bank coefficients. Twelve MFCCs are computed using DCT transformation on these filter bank coefficients. The delta coefficients are computed and appended to the basic acoustic vectors in the front-end. We use the HTK tools 3.0 [12] for both the feature extraction and the HMM model training.

Evaluation of the proposed approaches
The performance of the proposed approaches is evaluated with the four types of noises. For the cepstral marginalisation and soft decision approaches, a simple nonadaptive linear spectral subtraction in (22) is employed as an enhancement preprocess: where λ is the flooring factor, which is set to 0.05 in the experiments. The first 20 frames of noisy speeches are assumed to be the noises. Their average power spectra are used to es-timateN( j). We empirically set α, β 1 -β 4 , and γ 1 -γ 4 to 1.0. The HTK recognition process is modified according to (15) and (16) to implement the marginalisation and soft decision approaches. Table 1 shows the average recognition rates of the baseline MFCC recognizer and the proposed approaches. For comparison, the results of the spectral subtraction (SS), cepstral mean subtraction (CMS), and filter bank marginalisation with SNR criterion plus strict mask are also listed in the table. Here, "MG" refers to marginalisation and "SD" refers to soft decision.
Both the SS and CMS gain improvements over the baseline performance. It can be seen that the cepstral mean subtraction is less effective for additive noises than the spectral subtraction. This is probably because the CMS is mainly designed to cope with the stationary convolution distortions. Both the proposed approaches and the filter bank marginalisation show significant improvements over these two techniques. Comparing with the filter bank marginalisation, the cepstral marginalisation gives higher average recognition rates for the four types of noises. It is worse for the white noise, slightly better for the F16 noise and pink noise, and significantly better for the factory noise. The cepstral SD approach is superior to both marginalisation approaches for all types of noises. These results confirm our prediction that the cepstral marginalisation can work well for many kinds of full band noises, and also show the effectiveness of the SD approach.

Combination of the criteria and the weights
To show the effectiveness of our combined criteria for the cepstral marginalisation, Table 2a lists the average recognition rates of different criteria for the four types of noises.
Here, criterion 1 refers to the criteria shown in (4) and (12), criterion 2 is from (8) and (14), and the combined criterion is from (4), (8) and (14). The results of the SS are also listed in the table.
It can be seen that the recognition rates are improved whenever the marginalisation approaches are applied with criterion 1, criterion 2, or the combined criterion. For individual criteria, criterion 1 gives better performance than criterion 2. This is probably because criterion 1 is more closely related to the enhancement preprocess. Nevertheless, the combined criterion is able to achieve the highest recognition rates. Thus, it can conclude that the joint decision is more accurate than the individual one.
The average recognition rates of the cepstral SD approach with the individual or combined weights are also listed in the Table 2b. Here, weight 1 is used from (19), weight 2 is derived from the (20), and the combined weight refers to (21). As the combined criterion does in the cepstral marginalisation, the combined weight also gives the best performance in the cepstral SD approaches.

Influence of different types of noises to the cepstral feature
One of the major factors that affect the performance of the marginalisation and SD approaches is how severely the noises distort the features. If we consider the effect of cepstral distortions to be additive, the normalized mean square error (NMSE) can be used to evaluate the distortion level of a cepstral component [13]. To show the impacts of different types of noises to the MFCCs, we compute the NMSE between the corresponding components of the clean and noisy MFCCs when the SNR is 10 dB. The results are listed in Table 3.
As can be seen, the four types of full band noises distort all the MFCC components. For the white noise and pink noise, C1 are the mostly affected. For the F16 noise, C9 and C10 are much more affected than the other components. Obviously, the additive noises in the time domain cause the signal to be distorted in the cepstrum domain. The level of distortions depends both on the level of noises and the clean speech. The results in Table 3 show the trend that the noises with flat spectra will distort the lowest cepstral component most. The noises with energies that concentrate on some frequency bands will give particular distortions to some cepstral components. Due to the nonstationary property of factory noise, it is hard to analysis its impact through the NMSE. But the result shows that C1 and the higher-order coefficients are more affected. Among the four types of noises, the NMSE of white noise is the largest. This phenomenon explains why the cepstral marginalisation approach performs worse under the white noise condition.

CONCLUSION
In this paper, we propose the new cepstral marginalisation and cepstral soft decision approaches for the MFCCs.
In the experiments on the TI46 speaker-dependent isolated word corpus and four types of noises from the NOISEX-92 database, it shows that the proposed approach can efficiently improve the performance of the MFCC recognizer and give higher average recognition rates than the filter bank marginalisation. It shows that the marginalisation approach that is applied to the features rather than filter bank representations can also perform well when these features are not too severely affected by the environment noises. The cepstral soft decision approach gives the best performance in the experiments. It is believed that further improvement can be gained when the weights are determined in a more precise manner.