Model Compensation Approach Based on Nonuniform Spectral Compression Features for Noisy Speech Recognition

This paper presents a novel model compensation (MC) method for the features of mel-frequency cepstral coe ﬃ cients (MFCCs) with signal-to-noise-ratio-(SNR-) dependent nonuniform spectral compression (SNSC). Though these new MFCCs derived from a SNSC scheme have been shown to be robust features under matched case, they su ﬀ er from serious mismatch when the reference models are trained at di ﬀ erent SNRs and in di ﬀ erent environments. To solve this drawback, a compressed mismatch function is deﬁned for the static observations with nonuniform spectral compression. The means and variances of the static features with spectral compression are derived according to this mismatch function. Experimental results show that the proposed method is able to provide recognition accuracy better than conventional MC methods when using uncompressed features especially at very low SNR under di ﬀ erent noises. Moreover, the new compensation method has a computational complexity slightly above that of conventional MC methods.


INTRODUCTION
The problem of achieving robust speech recognition in noisy environments has aroused much interest in the past decades. However, drastic degradation of performance may still occur when a recognizer operates under noisy circumstances. Resolutions to this problem can be generally divided into three categories: inherently robust feature representation [1], speech enhancement schemes [2], and model-based compensation [3][4][5][6]. More details are reviewed in [7]. Recently, different speech analyses based on psychoacoustics have been reported in the literature [8]. The well-known perceptual linear prediction (PLP) [9] uses critical band filtering followed by equal-loudness pre-emphasis to simulate, respectively, the frequency resolution and frequency sensitivity of the auditory system. Cubic-root spectral magnitude compression with a fixed compression root is subsequently used to approximate the intensity-to-loudness conversion. However, it is suboptimal to use a constant root for compressing all the filter bank outputs, because employing a constant compression root would over-compress some outputs and under-compress other outputs at the same time.
A new kind of noise-resistant feature by employing a SNR-dependent nonuniform spectral compression scheme was presented in [1], which compress the corrupted speech spectrum by a SNR-dependent root value. [1] has shown that the SNSC derived mel-frequency cepstral coefficients (SNSC-MFCC) features are able to provide recognition accuracy better than the conventional MFCC features and cubicroot compressed features. In a SNSC scheme, the compressed speech spectra in the linear-spectral domain, Y k , is expressed as where Y k is the kth mel-scale filter bank output of a corrupted speech segment and α k is the compression root for the kth filter band, which is SNR-dependent. However, since α k is SNR-dependent, estimation of noise is required in the training session for finding α k under a particular noise type and global SNR. Thus models estimated by training in this way should only be used for a recognizing task under the same global SNR and noise environment. So as not to reestimate the model when adopting a SNSC scheme, we need to compensate the models for the mismatch 2 EURASIP Journal on Advances in Signal Processing caused by the compression root. This paper presents a compensation scheme to compensate the recognition models trained with clean and uncompressed training data for melfrequency cepstral coefficients SNSC-MFCC features in various noisy environments. In this scheme, we start with using conventional MC methods such as the PMC [3,4] method or the VTS [6] approach, to produce compensated models for features of no compression. The means and variances of the compressed mismatch function are derived in the paper. With the use of Gaussian-Hermite numerical integrals [10], a model compensation procedure is developed. Most importantly, the new compensation scheme is applicable to any conventional model compensation method. The experimental results of the paper show that the new compensated models provide very good accuracy in recognizing SNSC-MFCC features at different SNRs in different noisy environments. The computational complexity of the proposed MC-SNSC method is comparable with conventional MC methods. We call our new scheme the model compensation approach based on SNR nonuniform spectral compression (MC-SNSC).
The structure of this paper is as follows. The SNSC method is briefly reviewed in Section 2. In Section 3, we will introduce the MC-SNSC approach. Series of experimental results along with discussion and analyses are then presented in Section 4. Our conclusions on this study will be given in the final section.

SNR-DEPENDENT NONUNIFORM SPECTRAL COMPRESSION
The functional diagram of the generation of SNSC-MFCC features is depicted in Figure 1. The testing utterance is segmented into frames using a Hamming window. The frequency spectra of the speech segments are computed via discrete Fourier transform (DFT). Their squared magnitude spectra are passed to the mel-scaled filter bank. After the melscaled bandpass filtering, the spectral compression is applied to the outputs as in (1). Taking the log of the compressed outputs and then the discrete cosine transform, we obtain the SNSC-MFCC features. Simulated by the spectrally partial masking effect, the compression function α k is defined as where A 0 is the floor compression root, β is the cutoff parameter to function as the just-audible threshold, γ is the parameter to control the steepness of the compression function, and u(·) is the unit step function. For SNR less than the cutoff, (2) yields the floor compression value. The compression function produces small α k at a steep rate of change for small band SNR above the cutoff and large α k asymptotically close to one at a gradual rate for large band SNR. This SNSC scheme renders the filter bank outputs of low SNR less con-Windowed noisy speech signal y(n)

P(i)
Mel-scaled band-pass filter Log followed by DCT SNSC-derived MFCC (static feature) Filter-bank output energies of the noise estimate tributed to the resulting speech features while the outputs of high SNR are largely emphasized. The mismatch function Y k of the kth mel-filter bank output, which is modeled as the sum of the noise energy N k and the clean speech energy X k in the linear-spectral domain, is expressed as We define the clean speech and noise segment in the Logspectral domain as X (l) k and N (l) k , respectively, then the mismatch function in the log-spectral domain is expressed as Thus the compressed mismatch function for the SNSC in the log-spectral domain is expressed as where In this paper, we make the following assumptions in order to facilitate the derivations of the MC procedures. (1) The recognition model is a standard HMM with mixture Gaussian output probability distributions. The transition probabilities and mixture component weights of the models are assumed to be unaffected by the additive noise. (2) The background noise is additive, stationary, and independent of the speech.
The notations for the description of variables in the paper are defined as follows. The superscripts (l) mean the   Figure 2 shows the functional diagram of the recognition system using model compensation for SNSC-MFCC features.

MODEL COMPENSATION APPROACH BASED ON THE SNSC SCHEME
In the training phase, clean speech HMMs are trained from standard MFCC features of which no compression is applied or the compression root is just equal to one. During the feature extraction in the testing phase, the SNSC scheme as described in (1) is used to compress each filter bank output. The clean HMMs are combined with the noise model to construct the corrupted speech models to recognize the SNSC-MFCC features using MC-SNSC approach.
There are no closed-form solutions for the moments of the mismatch function in (5) and (6). The expectations are multidimensional integrals for which we need to use computationally expensive numerical integrations to calculate the model parameters. With the use of assumption (2) and an additional assumption that the two random variables Y (l) k and N (l) k are uncorrelated, we can reduce the dimensionality of the integration. Using the Gauss-Hermite numerical integral method, we derive the procedures for computing the means and variances of the static features in the log-spectral domain in the next subsections.

Mean compensation
Using the compressed mismatch function described in (5), the mean of the static SNSC-MFCC feature in the log-spectral domain is given by For the sake of simplifying the expression, we define Then the mean parameters of the static corrupted and compressed features are expressed as Using the Gauss-Hermite integral, g(γ) is calculated as (1/γ) Σ (l) Ykk , and erf(·) is the error function. The parameters t i and ω i for i = 1 to n are, respectively, the abscissas and the weights of the nth-order Hermite polynomial H n (t) [10].

Variance compensation
The diagonal elements of the covariance matrix of the SNSC-MFCC static features are given by where The computations of the off-diagonal elements of the covariance matrix of static models involve two dimensional Gaussian-Hermite numerical integrals. To reduce the computational complexity, the off-diagonal elements are approximated as where λ lk is a scaling factor defined as in order to ensure that the off-diagonal elements are smaller than the corresponding diagonal elements.

Corrupted models of noncompressed features
The above MC-SNSC procedures need the compensated static models of noncompressed corrupted speech in the logspectral domain, { μ (l) Yk , Σ (l) Ykl }. They can be obtained from any conventional model-based compensation methods such as the PMC method [3,4] or the VTS (Vector Taylor series) [6].
In the log-normal PMC method, the kth elements of the mean vectors and the (k, l)th elements of the covariance matrices of the clean speech models in the linear-spectral domain are related to the log-spectral domain as μ Xk = e μ (l) In the linear-spectral domain, the noise is assumed to be additive and independent of the speech. The corrupted speech model parameters in this domain are obtained by combining the clean speech models and the noise model as For the log-add PMC, the mean compensation is described as This method only compensates for the mean but not the variance. It thus has low computational complexity. However, its performance becomes unsatisfactory at low SNR. This scheme can be viewed as the zeroth-order VTS (denoted as VTS-0). The VTS method is to approximate the mismatch function by a finite length Taylor series, and the expectation of this Taylor series is taken to find the corrupted speech model parameters. A higher-order Taylor series can yield a better solution but its computational complexity is very expensive. Thus VTS-0 and first-order VTS (VTS-1) [6] are employed commonly. Using the VTS-1 method, the compensation of the mean is the same as the log-add PMC, and the covari- where M is the diagonal matrix whose elements are expressed as As a brief summary, the MC-SNSC method uses the background noise model and the uncompressed corruptedspeech models to compute the compressed corrupted speech models. The band SNR-dependent SNSC is employed in this scheme to compress the features so as to emphasize the signal components of high SNR and de-emphasize the highly  (1,2,3) For the Gauss-Hermite integral, n = 4 is employed. * Average WRR (%) between −5 and 5 dB.
noisy ones. The compressed corrupted speech models are then used for recognizing the SNSC-compressed testing features.

EVALUATION
In this section, three noise types from the NOISEX-92 database are used in the evaluation experiments including white, pink, and factor noises. The speech database used for the evaluation of the MC-SNSC techniques is TI-20 database from Ti-Digits which contains 20 isolated words, including digits "0" to "9" plus ten extra commands like "help" and "repeat." The speech database was spoken by 16 speakers (8 males and 8 females), and we select 2 and 16 utterances for training and testing, respectively, from each speaker and each word (641 utterances for training and 5081 utterances for testing). The length of the analysis frame (Hamming windowed) is 32 milliseconds, and the frame rate is 9.6 milliseconds. The feature vector is composed of 13 static cepstral coefficients.
A word-based HMM with six states and four mixture Gaussian densities per state is used as the reference model. In the training mode, we train the system with the clean speech utterances to produce clean models and corrupted speech for the matched case. In the testing, the ten speech recognition methods as listed in Table 1 are used for the performance evaluation. These nine methods are two mismatched and two matched cases; three conventional model-based compensation methods: the log-normal, the log-add PMC, and the first order VTS (denoted as VTS-1); and these three conventional methods plus the MC-SNSC method.
For our MC-SNSC approach, an average background noise power spectrum is needed to estimate the background noise model, and to estimate the band SNR for calculating the SNSC-derived features in the testing phase. The average noise power spectrum is calculated by using 200 nonoverlapping frames of noise data and is scaled according to a specified global SNR. The global SNR for an utterance is defined as where {P m (k)} is the clean speech power spectrum of the mth frame, {N (k)} is the nonscaled average noise power spectrum, O is the total number of frame for the utterance, Q is the FFT size, and g is the scaling factor to scale the ratio according to a specified SNR global . Thus, the corrupted speech is produced by where y(i) is the corrupted speech, x(i) and n(i) are the clean speech and the nonscaled noise signal, respectively.  Table 2. For the MC-SNSC method, the parameters (A 0 , β, γ) are set according to lots of testing experiments. The method can obtain good performance when the parameters are set in the area of A 0 ∈ [0.7, 0.9], β ∈ [−0.6, 0.6], and γ ∈ [1,2]. In this work, we fix the parameter set as A 0 = 0.75, β = −0.4, and γ = 1.
The results show that all MC methods can achieve good performance for the three additive noises at low SNR. For the sake of comparison, we define an average performance gain G ave of a MC method as the average of the difference of the recognition rates in absolute percentage of the MC method using MC-SNSC and its original counterpart over the four noises. For the −5 dB case, the G ave of the MC-SNSC plus the log-add PMC, the MC-SNSC plus the log-normal PMC, the MC-SNSC plus the VTS-1 are 11%, 10.5%, and 5%, respectively. For 0 dB case, the G ave of the three methods are 9.5%, 7%, and 4.3%, respectively. The experimental results also show that the MC-SNSC scheme can enhance the performance of the original method under the four noises for all SNR cases. It is worth noting that at low SNR as 0, −5 dB, even MC-SNSC gives a better performance than the matched case based on MFCC features.
These experimental results reveal that the new MC-SNSC scheme can deal with different types of additive noise and yield remarkable recognition performance, which is attributed to the noise-resistant feature extraction (SNSC scheme) [1] and pertinent model compensation. Table 3 lists the number of multiplication, division, logarithm, and exponential operations for each technique to update the parameters of a single mixture density for static parameters, where N and M are the dimensions of features in the cepstral domain and the log-spectral domain, respectively. It can be seen that the computational complexity of the MC-SNSC plus the conventional MC methods is comparable to that of the conventional MC methods. However, the MC-SNSC is more effective than the conventional model compensation methods.

CONCLUSION
A novel model compensation approach for robust SNSC-MFCC features is presented in this paper. Meanwhile a com-pressed mismatch function is defined for the static observations with nonuniform spectral compression. The modelbased compensation method for compressed feature has been derived, which employs a Gauss-Hermite integral and the conventional MC approach. The experimental outcome demonstrates that the MC-SNSC approach can cope with different kinds of noises automatically with enhanced recognition accuracy substantially, especially in low SNR in comparison with the conventional MC approaches. In addition, the complexity of the MC approach plus the MC-SNSC method is not very expensive and it is comparable with a correspondent MC approach.