 Research
 Open Access
 Published:
Enhancing the magnitude spectrum of speech features for robust speech recognition
EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 189 (2012)
Abstract
Abstract
In this article, we present an effective compensation scheme to improve noise robustness for the spectra of speech signals. In this compensation scheme, called magnitude spectrum enhancement (MSE), a voice activity detection (VAD) process is performed on the frame sequence of the utterance. The magnitude spectra of nonspeech frames are then reduced while those of speech frames are amplified. In experiments conducted on the Aurora2 noisy digits database, MSE achieves an error reduction rate of nearly 42% relative to baseline processing. This method outperforms wellknown spectraldomain speech enhancement techniques, including spectral subtraction (SS) and Wiener filtering (WF). In addition, the proposed MSE can be integrated with cepstraldomain robustness methods, such as mean and variance normalization (MVN) and histogram normalization (HEQ), to achieve further improvements in recognition accuracy under noisecorrupted environments.
Introduction
The environmental mismatch caused by additive noise and/or channel distortion often seriously degrades the performance of speech recognition systems. Various robustness techniques have been proposed to reduce this mismatch, which can be roughly divided into two classes: modelbased and featurebased approaches. In modelbased approaches, compensation is performed on the pretrained recognition model parameters so that the modified recognition models can more effectively classify the mismatched test speech features collected in the application environment. Typical examples of this class include noise masking [1–3], speech and noise decomposition (SND) [4], vector Taylor series (VTS) [5], maximum likelihood linear regression (MLLR) [6], modelbased stochastic matching [7, 8], model compensation based on nonuniform spectral compression (MCSNSC) [9], statistical reestimation (STAR) [10], and parallel model combination (PMC) [11–13] methods. In the featurebased approaches, a noiserobust feature representation is developed to reduce the sensitivity to various acoustic conditions and thereby alleviate the mismatch between those features used for training and testing. Examples of this class include spectral subtraction (SS) [14–17], Weiner filtering [18, 19], shorttime spectral amplitude estimation based on minimum meansquared error criteria (MMSESTSA) [20], MMSEbased logspectral amplitude estimation (MMSE logSTSA) [21], codeworddependent cepstral normalization (CDCN) [22], SNRdependent nonuniform spectral compression scheme (SNSC) [23], featurebased stochastic matching [7, 8], multivariate Gaussianbased cepstral normalization (RATZ) [10], stereobased piecewise linear compensation for environments (SPLICE) [24, 25] methods, and a series of cepstralfeature statistics normalization techniques such as cepstral mean subtraction (CMS) [26], cepstral mean and variance normalization (MVN) [27], MVN plus ARMA filtering (MVA) [28], cepstral gain normalization (CGN) [29], histogram equalization (HEQ) [30, 31], and cepstral shape normalization (CSN) [32]. A common advantage of the featurebased methods is their relative simplicity of implementation. This simplicity arises because all of these methods focus on frontend speech feature processing without any need to change the backend model training and recognition schemes. Despite their simplicity, these methods usually improve recognition performance significantly in noisecorrupted application environments.
The melfrequency cepstral coefficient (MFCC) is one of the most widely used speech feature representations due to its high recognition performance under clean conditions. However, MFCC is not very noiserobust, and thus many robustness techniques mentioned above can be applied in various domains of a speech signal when deriving MFCC. For example, SS, WF, MMSESTSA, and MMSE logSTSA techniques are used in the spectral domain whereas CMS, MVN, MVA, and HEQ are often used in the cepstral domain. In particular, the method presented in this article is designed to compensate the spectrum of the speech signal to obtain more noiserobust MFCC.
In addition to MFCC features, the energyrelated feature, i.e., the logarithmic energy (log E), is also effective in discriminating different phonemes. For this reason, it is often appended to the MFCC features to further enhance recognition performance. However, similar to MFCC, the log E feature is vulnerable to noise. In many recent studies [33–35], it has been found that compensating the log E feature can improve the recognition accuracy significantly under noisy conditions. For example, in our previously proposed method, silence feature normalization (SFN) [35], highpassfiltered log E is used as the indicator for speech/nonspeech frame classification, and the log E features of nonspeech frames are set to be small, while those of speech frames are kept nearly unchanged. We have shown that SFN is very effective despite its simplicity in implementation.
Partially inspired by the concept of SFN, in our previous work [36] we presented another approach, called magnitude spectrum enhancement (MSE) to further process the magnitude spectra of speech frames. Initial experiments shown in [36] have indicated that MSE produced good results on the Aurora2 evaluation task [37]. The main purpose of this article is to provide a rigorous investigation for the background of MSE, as well as a series of experiments to further show the effectiveness of MSE in reducing the effect of noise for speech recognition. In MSE, the noisecorrupted signal is processed in the linear spectral domain, with the hope that the resulting speech features are more noiserobust. Briefly speaking, in MSE, the magnitude spectrum of each nonspeech frame is set to be small (as in SFN), whereas the magnitude spectrum of each speech frame is amplified by multiplying by a weighting factor that is related to the signaltonoise ratio (SNR). The main purpose of MSE is to highlight the spectral difference between the speech and nonspeech frames, not to reconstruct the clean speech spectrum as SS and WF do. The experiments conducted on the Aurora2 digit database show that our proposed MSE can provide a significant improvement in recognition accuracy in various noisecorrupted environments. MSE performs better than many spectraldomain methods, and it can be well integrated with cepstraldomain processing techniques, such as MVN, MVA, and HEQ. The best possible average accuracy rate for the Aurora2 cleancondition training task with the proposed method can be as high as 83.80%.
The remainder of this article is organized as follows. Section ‘Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal’ provides a mathematical analysis of a noisecorrupted speech signal as background knowledge for the presented MSE. Next, detailed MSE procedures are described in Section ‘The magnitude spectrum enhancement (MSE) approach’. Section ‘Experimental results and discussions, contains the experimental setup and a series of experimental results together with the corresponding discussions. Finally, the concluding remarks are given in Section ‘Conclusions’.
Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal
In this section, we provide a mathematical analysis for the effects of additive noise to the linear and logarithmic magnitude spectrum in a speech signal. Observing these effects will help us develop and present the new noiserobustness approach in Section ‘The magnitude spectrum enhancement (MSE) approach’.
Effect of additive noise on the magnitude spectra of speech/nonspeech frames
Assume that the signal for an arbitrary frame of a noisecorrupted utterance can be represented by
where m is the frame index, M is the total number of frames, and s_{ m }n and d_{ m }n are the speech and noise components of x_{ m }n, respectively. Taking the discrete Fourier transform (DFT) on both sides of Equation (1), we have
where X_{ m }k, S_{ m }k, and D_{ m }k represent the spectra of x_{ m }n, s_{ m }n, and d_{ m }n, respectively, for the k th frequency bin. Obviously, the speech component S_{ m }k approaches zero in Equation (2) for a nonspeech frame. Here, a parameter called the magnitude spectral ratio (MSR) is defined as
and represents the expectation of the ratio of a speech frame (frame p) to a nonspeech frame (frame q) in the magnitude spectrum for the k th frequency bin. It can be shown that, under an additive white Gaussian noise (AWGN) environment and assuming that S_{ p }k is a constant, S_{ p }k + D_{ p }k and D_{ q }k in Equation (3) are two random variables with Rician and Rayleigh distributions [38], respectively. The parameter MSR in Equation (3) is then
where σ^{2} is the variance of the real and imaginary parts of the noise D_{ m }k, m=p, q, and I_{0}(.) and I_{1}(.) are the modified Bessel functions of the first kind with orders zero and one, respectively. Furthermore, γ k in Equation (4) is, in fact, monotonically decreasing with respect to the noise variance σ^{2} (see Appendix 1 for a detailed analysis of the above results), indicating that speech frames become increasingly indistinguishable from nonspeech frames based on their magnitude spectra as the signaltonoise ratio (SNR) decreases.
Effect of additive noise on the logarithmic magnitude spectrum in the frame sequences
First, we investigate the effect of noise on the logarithmic magnitude spectrum in an arbitrary frame within an utterance. According to Equation (2), we have
where ${X}_{m}^{\left(l\right)}\left[k\right]$, ${S}_{m}^{\left(l\right)}\left[k\right]$, and ${D}_{m}^{\left(l\right)}\left[k\right]$ are the logarithmic magnitude spectra of x_{ m }[n], s_{ m }[n], and d_{ m }[n], respectively, from Equation (1). Thus, the difference between ${X}_{m}^{\left(l\right)}\left[k\right]$ (for the noisecorrupted speech) and ${S}_{m}^{\left(l\right)}\left[k\right]$ (for the embedded clean speech) is
From Equation (6), it is obvious that under the same noise magnitude level D_{ m }[k], the difference Δ[k] decreases as the speech magnitude S_{ m }[k] increases. Therefore, for a noisecorrupted utterance, the logarithmic magnitude spectrum of the speech frame is often less vulnerable to noise than that of the nonspeech (noiseonly) frame. However, this condition does not hold for the (linear) magnitude spectrum.
Next, let us consider the effect of noise on the frame sequence of logarithmic magnitude spectra, denoted by ${\left\{{X}_{m}^{\left(l\right)}\right[k\left]\right\}}_{m=0}^{M1}$, for the utterance. Taking the Taylor series approximation of Equation (5) with respect to $\left({S}_{m}^{\left(l\right)}\right[k],{D}_{m}^{\left(l\right)}[k\left]\right)=(0,0)$ up to order 2, we have
Thus the modulation spectrum M_{ X }(jω) of the sequence ${\left\{{X}_{m}^{\left(l\right)}\right[k\left]\right\}}_{m=0}^{M1}$, computed by
can be approximated as
where M_{ X }(jω), M_{ S }(jω), and M_{ D }(jω) are discretetime Fourier transforms (DTFTs) of ${\left\{{X}_{m}^{\left(l\right)}\right[k\left]\right\}}_{m=0}^{M1}$, ${\left\{{S}_{m}^{\left(l\right)}\right[k\left]\right\}}_{m=0}^{M1}$, and ${\left\{{D}_{m}^{\left(l\right)}\right[k\left]\right\}}_{m=0}^{M1}$ (along the frame axis with the index m, as in Equation (8)), respectively, and the symbol “∗” denotes the convolution operation. If the two sequences, ${\left\{{S}_{m}^{\left(l\right)}\right[k\left]\right\}}_{m=0}^{M1}$ and ${\left\{{D}_{m}^{\left(l\right)}\right[k\left]\right\}}_{m=0}^{M1}$, are both lowpass and their bandwidths are B_{ s }and B_{ d }, respectively, then the terms M_{ D }(jω) ∗ M_{ D }(jω) and M_{ S }(jω) ∗ M_{ D }(jω) in Equation (9) have bandwidths of 2B_{ d } and B_{ s } + B_{ d }, respectively. This finding implies that ${\left\{{X}_{m}^{\left(l\right)}\right[k\left]\right\}}_{m=0}^{M1}$ has a wider bandwidth than ${\left\{{D}_{m}^{\left(l\right)}\right[k\left]\right\}}_{m=0}^{M1}$. In other words, the logarithmic magnitude spectrum of the noisecorrupted speech segment possesses higher modulation frequency components than that of the noiseonly segment in a noisy utterance. Again, this condition does not hold for the (linear) magnitude spectrum.
Note: it is easy to demonstrate that the above analysis of the logarithmic magnitude spectrum can be performed on the logarithmic energy (log E) sequence in an utterance, obtaining the same conclusions [35]. That is,

1.
The logarithmic energy is less distorted in a speech frame than in a nonspeech frame.

2.
For the logarithmic energy sequence of a noisy utterance, the speech segment possesses components of evenhigher frequency than the nonspeech segment.
The magnitude spectrum enhancement (MSE) approach
In this section, we describe a compensation scheme termed magnitude spectrum enhancement (MSE) [36] in order to improve the noise robustness of speech features. Briefly speaking, the magnitude spectra of the speech frames are enlarged in MSE whereas those of the nonspeech frames are normalized to be very small. In addition, the speech/nonspeech frame classification in this scheme is based on the logarithmic magnitude spectra and the logarithmic energy feature of the frames. Details of the MSE procedure are stated below.
Following the notations introduced in Section ‘Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal’, here {x_{ m }[n], 0≤n≤N−1} is the timedomain signal for the m th frame of an utterance and N is the frame length. The spectrum for this frame is calculated as
where K is the DFT size, and M is total number of frames in this utterance. Thus, X_{ m }[k] represents the magnitude spectrum for the k th frequency bin of the m th frame. In addition, the logarithmic energy (log E) feature of the m th frame is given by
The proposed magnitude spectrum enhancement (MSE) approach uses the following two steps to create the new magnitude spectrum.
Step I: Perform voice activity detection (VAD):
The VAD process that discriminates speech frames from nonspeech frames in an utterance is based on two sources: the logarithmic magnitude spectrum (abbreviated as logMS) in Equation (10) and log E in Equation (11). Based on the observations made in Section ‘Effect of additive noise on the logarithmic magnitude spectrum in the frame sequences’, noisecorrupted speech segments possess a greater number of high (modulation) frequency components in the logMS and log E sequence than noiseonly segments, and thus we expect that the highpassfiltered logMS and log E sequences help to obtain more accurate VAD results.
As for the first source, we process the logMS sequence ${\{log(\left{X}_{m}\right[k\left]\right\left)\right\}}_{m=0}^{M1}$ with a highpass IIR filter with an inputoutput relationship of
where 0≤λ<1 (the case λ=1 leads to an unstable filter). The frequency response (magnitude part) of the highpass IIR filter is depicted in Figure 1, showing that this filter emphasizes the higher frequency portions while not eliminating the nearDC components completely.
Next, we sum up the highpass filtered logarithmic spectrum, Y_{ m }[k], over the entire frequency band for each frame:
Thus, z_{ m } in Equation (13) is viewed as the cumulative highpassfiltered logarithmic spectral magnitude of the m th frame. Finally, the first speech/nonspeech decision parameter d_{m,1} is obtained as follows:
where the threshold θ_{ z } is simply set to the mean of the stream {z_{ m }, 0≤m≤M−1}.
As for the second source (the log E sequence) for the VAD process, we obtain the second speech/nonspeech decision parameter d_{m,2}for the m th frame,
where ${e}_{m}^{\left(h\right)}$ is the highpass filtered version of e_{ m } in Equation (11), in which the highpass IIR filter is the same as that used in Equation (12). Again, the threshold θ_{ e } is set to the mean of the stream $\{{e}_{m}^{\left(h\right)},\phantom{\rule{0.3em}{0ex}}0\le m\le M1\}$.
Finally, the result of the VAD process is obtained from the two parameters d_{m,1} in Equation (14) and d_{m,2} in Equation (15):
where d_{ m } is the VAD indicator finally used. That is, the m th frame is classified as speech if either d_{m,1} or d_{m,2} is equal to unity. The main reason for using the “or” operation in Equation (16) is that the speech frames are likely to be misclassified as nonspeech frames (i.e., a higher falserejection rate) when we simply depend on either decision parameter d_{m,1} or d_{m,2} alone, especially when the SNR degrades.
Step II. Obtain the enhanced magnitude spectrum
This step amplifies the magnitude spectrum for the speech frames while diminishing it for the nonspeech frames. The main purpose of this step is to enlarge the ratio of speech frames to nonspeech frames in magnitude spectra to reduce the noise effect, as discussed in Section ‘Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal’. The magnitude spectra for the nonspeech frames detected in Step I are first collected and then averaged to obtain the estimated noise (magnitude) spectrum for the utterance:
Note that here, N[k] is independent of the frame index m. Thus, the noise spectrum is estimated once for the utterance.
Next, a weighting factor for each magnitude spectral value X_{ m }[k] is defined as follows:
where α is a parameter within the range [0,1] that determines the degree of amplification, δ is a small positive constant that avoids the weighting factor becoming infinitely large as N[k]→0, and ε is a very small positive random variable such that the magnitude spectra of detected nonspeech frames are significantly reduced.
Thus, the weighting factor for a speech frame (d_{ m }=1) in Equation (18) is related to the SNR as follows:
where $\mathrm{SN}{R}_{m}\left[k\right]=\left(\frac{\left{X}_{m}\right[k]{}^{2}}{{N}^{2}\left[k\right]}\right)1$ is the (estimated) SNR for the k th frequency bin of the m th frame.
Finally, the enhanced magnitude spectrum is obtained by multiplying the original magnitude spectrum with the weighting factor w_{ m }[k] in Equation (18):
The proposed MSE has the following properties:

1.
In MSE, the embedded VAD process uses the logarithmic magnitude spectrum rather than the linear magnitude spectrum. According to the discussions in Section ‘Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal’, the logarithmic magnitude spectrum is less vulnerable to noise in speech frames, and its temporaldomain sequence exhibits a wider (modulation) spectral bandwidth in speech portions than in nonspeech portions. Based on these two characteristics, the logarithmic magnitude spectrum is a more appropriate VAD indicator than the linear magnitude spectrum. The experimental results shown later will reveal that the logarithmic magnitude spectrum outperforms the linear magnitude spectrum for providing MSE with better recognition accuracy.

2.
By assigning different weights to the magnitude spectra of speech and nonspeech frames, the speech portions of an utterance are highlighted and the difference between the speech and nonspeech portions in magnitude spectrum is strongly emphasized. This effect leads to a large magnitude spectral ratio (MSR) as defined in Equation (2) and implies that the effect of noise has been effectively reduced.

3.
The idea of MSE is partially motivated by the matched filter theory in the field of communications [38]. For an observed signal denoted by x n=s n + d n, where s n and d n are the desired signal and additive noise, respectively, the magnitude (frequency) response of the matched filter which maximizes the output SNR is [39]:
$$\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\leftH\right(\mathit{j\omega}\left)\right=\frac{\leftS\right(\mathit{j\omega}\left)\right}{{P}_{d}\left(\omega \right)},$$(21)
where S(jω) and P_{ d }(ω) are the magnitude spectrum of s n and the power spectral density of the noise d n, respectively. From Equation (21), we find H(jω) is proportional to the input frequency domain SNR (defined by $\frac{\leftS\right(\mathit{j\omega}){}^{2}}{{P}_{d}\left(\omega \right)}$) provided the signal level S(jω)^{2} or the noise level P_{ d }(ω) is fixed. Thus, MSE shares the idea of the matched filter and uses a spectral weighting factor w_{ m }k in Equation (18) which is positively correlated with the SNR. However, MSE differs from the matched filter in some aspects: First, MSE applies the magnitude spectrum of the noisy signal x n rather than that of the clean signal s n, which is not available and requires estimation. Second, the magnitude spectrum of the noise d n is used, which approximates the square root of the power spectral density of the noise. Finally, MSE additionally detects the nonspeech regions and makes the corresponding spectra nearly zero, which is a nonlinear operation and can further distinguish the speech and nonspeech frames.

4.
Compared to the SFN method [35], the magnitude spectrum in MSE for the features in nonspeech portions is set to be small. However, in speech portions of the utterance, MSE further amplifies the magnitude spectrum, whereas in SFN the energyrelated feature is kept nearly unchanged.

5.
Like spectral compensation techniques, spectral subtraction (SS) [14–16] and Weiner filtering (WF) [18, 19], MSE attempts to reduce the effect of noise in the spectral domain of speech signals. However, the main purpose of SS and WF is to restore a clean spectrum from the noisecorrupted utterance. This situation contrasts with MSE, where the (magnitude) spectrum of the speech portions is amplified, possibly making the resulting spectrum quite different from the clean spectrum. In general, the updated magnitude spectra using SS and WF are often presented as follows:
$$\text{SS:}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\left{\stackrel{~}{X}}_{m}\right[k\left]\right\approx \left{X}_{m}\right[k\left]\right{\left(1+\frac{1}{\mathrm{SN}{R}_{m}\left[k\right]}\right)}^{\frac{1}{2}},$$(22)
For MSE, the new magnitude spectrum is:
In addition, the speech and nonspeech portions are treated quite differently in MSE (as shown in Equations (18) and (20)), while they are not explicitly treated differently in SS and WF.

6.
In MSE, the VAD procedure used in Step I is quite simple to implement and can be replaced with any other VAD method. In addition, the cepstral features derived from the MSEprocessed spectrum can be further compensated using any cepstraldomain robustness techniques such as MVN, MVA, and HEQ to achieve further improvements in recognition performance, which will be shown in Section ‘Experimental results and discussions’.
Experimental results and discussions
We use two sets of experimental environments in this article. In the first environment, the Aurora2 connected USdigit database [37] is the platform for evaluating the proposed MSE and other various techniques. It is used to explore the resulting spectrograms of the speech signals processed by MSE and some other spectraldomain processes, to analyze the possible improvements achievable by each approach, and to discuss the comparisons among different techniques. On the other hand, in the second environment, the NUM100A continuous Mandarin speech database [40] is used. This database contains microphonerecoded Mandarin digit strings produced by Mandarin adults. We perform the proposed MSE on this data set to further investigate if MSE is still effective in processing the noisy speech that belongs to a different language.
Experiments for the Aurora2 database
Here, the presented MSE scheme has been tested with the AURORA Project Database Version 2.0 (Aurora2), the details of which are described in [37]. In short, the testing data consist of 4004 utterances from 52 female and 52 male speakers, and three different subsets are defined for the recognition experiments: Test Sets A and B are each affected by four types of noise, and Set C is affected by two types. Each noise instance is added to the clean speech signal at seven SNR levels (ranging from 20 to −5 dB). The signals in Test Sets A and B are filtered with a G.712 filter, and those in Set C are filtered with an MIRS filter. G.712 and MIRS are two standard frequency characteristics defined by the ITU [41].
The Aurora2 task has the following two training modes [37]:

1.
In the first mode, “cleancondition training”, the training data consist of 8440 clean speech utterances from 55 female and 55 male adults.

2.
In the second mode, “multicondition training”, the clean training data in the first mode are equally split into 20 subsets. These 20 subsets are added with four different types of noise at five different SNRs. The four noise types are suburban train, babble, car and exhibition hall, which are the same as the noise types in Test Set A. The SNRs are 20, 15, 10, and 5 dB and the clean condition.
Therefore, in the first mode, “cleancondition training”, the obtained clean acoustic models contain no information about the possible distortions. This mode can help us evaluate the degree of robust capability of the speech features (associated with the robustness algorithm) against noise. As for the second mode, “multicondition training”, the corresponding results can reveal the impact of a different type of noise or a different SNR than seen during training [37]. In our following experiments and discussions, we will primarily focus on the first mode in order to observe the presented MSE in the reduction of noise effect. However, we will also provide the experimental results for the second mode together with relatively brief discussions.
Results for the task of cleancondition training and multicondition testing
With the Aurora2 database under the mode of “cleancondition training”, we perform the MSE method and a series of robustness methods to compare the recognition accuracy. As for the cepstraldomain methods, each utterance in the clean training set and three testing sets is directly converted to 13dimensional MFCC (c 1–c 12, c 0) sequence according to the feature settings in [37]. Next, the MFCC features are processed using MVN, MVA or HEQ. The spectraldomain methods used here include our MSE, spectral subtraction (SS), Wiener filtering (WF) and MMSEbased logspectral amplitude estimation (MMSE logSTSA). Each utterance is first processed in the linear spectral domain. The updated spectra are converted to a sequence of 13dimensional MFCC ((c 1–c 12, c 0)). The resulting 13 new features, plus their first and secondorder derivatives, are the components of the final 39dimensional feature vector. With the new feature vectors in the clean training set, the hidden Markov models (HMMs) for each digit and silence are trained with the demo scripts provided by the Aurora2 CD set [42]. Each digit HMM has 16 states, with 3 Gaussian mixtures per state.
Detailed information about some of the methods used follows:

1.
We apply three versions of spectral subtraction (SS) proposed in [14–16]. For the purposes of clarity, they are denoted by SS_{Boll}, SS_{Berouti}, and SS_{Kamath}, respectively, in which the author names are represented by the subscripts.

2.
As with spectral subtraction, three versions of the Wiener filtering (WF) methods proposed in [18, 19] are tested here. The first method is based on a priori signal to noise ratio (PSNR) estimation, and the latter two WF methods apply a twostep noise reduction (TSNR) procedure and a harmonic regeneration noise reduction (HRNR) scheme, respectively. Thus, these methods are abbreviated as WF_{PSNR}, WF_{TSNR}, and WF_{HRNR} for later discussions.

3.
For the proposed MSE, the parameters δ in Equation (18) is set to 0.001, and the positive random number ε in Equation (18) is uniformly distributed within the range (0, 10^{−5}). In order to obtain a proper selection of the filter coefficient λ in Equation (12) and the weight parameter α in Equation (18), we use the 8440 noisecorrupted training utterances for the mode of “multicondition training” in the Aurora2 database as the development set. The averaged recognition accuracy rates with respect to different assignments of λ and α(both from 0.1 to 0.9 with an interval of 0.2) are shown in Table 1. As a result, we set λ and α to 0.7 and 0.5, respectively, since such a setting gets the optimal accuracy rate for the development set.

4.
For MVA, the order of the ARMA filter is set to 3.

5.
For HEQ, each feature stream in the utterance is normalized to approach a Gaussian distribution with zero mean and unity variance.
Comparison of various noise robustness approaches
Table 2 presents the individual set recognition accuracy rates averaged over five SNR conditions (0–20 dB at 5 dB intervals) for Test Sets A, B, and C, achieved using various approaches. Figure 2 shows the accuracy rates for spectraldomain methods under different SNR conditions, which are obtained by averaging over all ten noise types contained in the three Test Sets. Based on Table 2 and Figure 2, we make the following observations:

1.
Compared to baseline processing, most approaches provide significant recognition accuracy improvement in almost all cases. All three SS methods give better results than the baseline for Test Sets A and B, while the improvement for Test Set C is relatively insignificant. A possible explanation of this finding is that SS is particularly designed to alleviate additive noise and thus does not handle the channel mismatch in the utterances of Test Set C very well. On the other hand, WF_{PSNR} performs the best among the three Wiener filtering approaches, while WF_{TSNR} and WF_{HRNR} result in poorer accuracy rates relative to the MFCC baseline. Furthermore, WF_{PSNR} behaves better than SS and is also very helpful with Test Set C. Finally, the method “MMSE logSTSA” performs quite well, and its corresponding averaged recognition accuracy is slightly better than that of WF_{PSNR}.

2.
Among the spectraldomain methods studied, the proposed MSE method outperforms MMSE logSTSA and various versions of SS and WF in almost all cases. Furthermore, MSE leads to a relative error reduction rate of 49.82% for additivenoise conditions (Test Sets A and B) and 42.72% for all conditions (Test Sets A, B and C) compared with baseline results. The results show that MSE effectively enhances the robustness of MFCC in various noisecorrupted environments.

3.
The proposed MSE method provides very promising recognition accuracy rates for all SNR conditions. In particular, MSE outperforms WF_{PSNR} and MMSE logSTSA for higher SNR cases (20 and 15 dB), and the three methods deliver very similar accuracy rates for lower SNR cases.

4.
Among the three cepstraldomain methods, HEQ behaves the best, followed by MVA and then MVN. In addition, the three cepstraldomain methods perform better than most spectraldomain methods, with the exception that MVN performs worse than MSE for Test Sets A and B. This finding leads to the concept of integrating these cepstraldomain methods with the proposed MSE as discussed below. It will be shown that such integration can offer further improvements in performance.

5.
In order to examine if the presented MSE gives rise to a statistically significant improvement in recognition accuracy relative to the other methods, the oneproportion ztest [43] is performed as follows: Let p and p _{0} denote the accuracy rates provided by MSE and the method for comparison, respectively. We set the null hypothesis as H _{0}:p=p _{0} and the alternative hypothesis H _{1}:p>p _{0}, and the test statistic for the hypothesis is:
$$z=\frac{p{p}_{0}}{\sqrt{{p}_{0}(1{p}_{0})/N}},$$(25)
where N is the number of words in the test and here N=214465 for the Aurora2 evaluation task [37]. If the test statistic z in Equation (25) is larger than about 2.326, then the null hypothesis H_{0} is rejected and the improvement is statistically significant with a confidence level of 99% (since ${\int}_{2.326}^{\infty}\frac{1}{\sqrt{2\Pi}}{e}^{\frac{{u}^{2}}{2}}\mathrm{du}\approx 1\%=199\%$). According to the obtained test statistic z in Equation (25), we find that the improvement brought by MSE relative to the other spectraldomain methods is statistically significant. For example, when the method for comparison is MMSE logSTSA, the corresponding test statistic z in Equation (25) is 41.99, far larger than the threshold 2.326.
In addition to the recognition accuracy, we also examine the various spectraldomain methods’ capabilities of reducing the spectrogram mismatch caused by additive noise. Figures 3, 4, 5, 6, 7, 8, 9, and 10 show the spectrograms of a digit utterance (“FLJ_97159A.08” in the Aurora2 database) for two SNR levels, clean and 5 dB (with babble noise), obtained by SS_{Boll}, SS_{Berouti}, SS_{Kamath}, WF_{PSNR}, WF_{TSNR}, WF_{HRNR}, MMSE logSTSA and the proposed MSE, respectively. First, the figures show that for the clean case, the voiced portions and the short pauses between any two consecutive digits or syllables are clearly revealed using almost all approaches. Second, for the noisecorrupted case, WF_{PSNR}, MMSE logSTSA, and MSE highlight the short pauses more than the other approaches, and they preserve the voiced segments better with less distortion (especially in the region [0.7 s, 1.3 s]). Thus, the similar treatment of these short pauses under clean and noisecorrupted conditions using the three methods may result in a relatively insignificant mismatch between the two SNR conditions, causing the higher recognition accuracy shown previously. Finally, the detected speech segments are quite obviously separated in the MSEprocessed spectrogram, and this fact may be one reason why MSE performs very well.
Integration of MSE with cepstral feature processing techniques
MSE, which is performed on the spectral domain of features, can be easily integrated with cepstraldomain processing techniques. Here, we test whether such integration brings about further recognition performance. MFCC features are first derived from the MSEprocessed spectra and then processed using MVN, HEQ, or MVA. For a more complete comparison, we also integrate any of the spectraldomain methods, SS_{Berouti}, WF_{PSNR}, and MMSE logSTSA, with the cepstraldomain method. The corresponding recognition results are shown in Table 3. For the comparison purposes, the accuracy rates for MSE, SS_{Berouti}, WF_{PSNR}, MMSE logSTSA, MVN, HEQ, and MVA are relisted from Table 2. Several findings are reported in Table 3:

1.
The combination of MSE and the cepstraldomain method produces better results than the individual component methods in most cases. For example, MSE plus MVA (82.37%) is better than MSE (76.94%) and MVA (78.75%) in recognition accuracy averaged over ten noise types among the three Test Sets and results in a relative error reduction rate of 56.20%. Similar results are achieved with MSE plus MVN and MSE plus HEQ. These results clearly indicate that MSE can be successfully added to cepstraldomain approaches to further improve noise robustness.

2.
For the channeldistorted signals in Test Set C, MSE performs worse than the cepstraldomain methods alone. However, combining MSE with either of MVN and MVA can yield better recognition rates with regard to Test Set C. For example, MSE plus MVA (80.58%) is better than MVA alone (79.12%) in averaged recognition accuracy. Therefore, MSE enhances MVN and MVA in processing channeldistorted signals even though it is primarily designed for additivenoise conditions.

3.
Different from MSE, combining any of the three spectraldomain methods, SS_{Berouti}, WF_{PSNR}, MMSE logSTSA, with any cepstraldomain method performs worse than the component cepstraldomain method alone. For example, MMSE logSTSA plus HEQ achieves an averaged accuracy of 79.58%, less than 82.21% obtained by the single HEQ. These results again imply that the presented MSE outperforms the other three spectraldomain methods used here.
The influence of the VAD error for MSE in speech recognition
In this section, we first investigate the effect of the VAD error on recognition performance in MSE. For this purpose, we perform MSE under the “oracle condition”. That is, the VAD results for each clean utterance are directly applied to its various noisecorrupted counterparts to implement the magnitude spectrum enhancement. This process is referred to as “MSE^{(o)}” here. Assuming that the VAD error of MSE for a clean utterance is small and negligible, the recognition accuracy difference between MSE^{(o)}and MSE for noisecorrupted utterances can be viewed as a consequence of the VAD error due to noise.
The recognition accuracy rates for MSE^{(o)} and MSE are listed in Table 4. As expected, MSE^{(o)}always performs better than MSE because it contains no VAD errors. However, the difference in accuracy is not very significant. In the worst case (SNR = 0 dB), the performance degradation is 4.96% (1.64% for Set A, 8.77% for Set B, and 4.00% for Set C), and on average, it is 2.90% (1.57% for Set A, 3.92% for Set B, and 3.49% for Set C). These results indicate that the performance of MSE is somewhat influenced by the the error of the embedded VAD process.
Next, we select different VAD indicators for MSE to see the corresponding effect. According to the analysis in Section ‘Effect of additive noise on the logarithmic magnitude spectrum in the frame sequences’, the highpass filtered logarithmic magnitude spectrum (log MS) and the logarithmic energy (log E) can emphasize the difference of the speech and nonspeech frames, and thus they are chosen to be the VAD indicators of MSE. Here, we adopt the following two alternatives as the VAD indicators:

1.
the original linear magnitude spectrum (_{ X m }[k] in Equation (10)) and the energy (the exponent of _{ e m }Equation (11)),

2.
the highpass filtered linear magnitude spectrum and the highpass filtered energy,
and the corresponding two MSE processes are denoted by ${\text{MSE}}^{\left({L}_{1}\right)}$ and ${\text{MSE}}^{\left({L}_{2}\right)}$, respectively, for simplicity. Figure 11 shows the recognition accuracy rates for ${\text{MSE}}^{\left({L}_{1}\right)}$ and ${\text{MSE}}^{\left({L}_{2}\right)}$under different SNR conditions for the three Test Sets, and we add the results of the original MSE in this figure for comparison. From this figure, we find that when the SNR is high (clean and 20 dB), there is no substantial performance difference among the three MSE methods. However, when the noise level becomes larger, the original MSE significantly outperforms the other two versions of MSE, ${\text{MSE}}^{\left({L}_{1}\right)}$, and ${\text{MSE}}^{\left({L}_{2}\right)}$. As a result, compared with the linear magnitude spectrum and energy, the highpass filtered logarithmic magnitude spectrum (as well as the logarithmic energy) can provide more accurate VAD under noisy conditions and achieve better recognition results for the subsequent MSE processing.
Further issues regarding MSE processing
Several issues relating to the new proposed MSE scheme are further investigated in this Section.
The effect of the exponent α in MSE
One of the central ideas of MSE is to amplify the spectral magnitude for speech frames, and from Equations (18) and (24), the amplification factor (for speech frame) is
Examining Equation (26), the exponent value α controls the degree of amplification. Increasing the value of α enlarges the difference between the speech and nonspeech frames in magnitude spectrum and may also lead to a greater mismatch among the speech frames for the same syllable or phoneme under different SNR conditions. As a result, a larger α in MSE does not always bring about improved recognition accuracy, even if the VAD contains no errors. Here, we assign the exponent α to different values within the range [0, 1] and then proceed with MSE to investigate the corresponding recognition accuracy.
Figure 12 shows the recognition results averaged over five SNR conditions (0 ∼ 20 dB) and all ten noise types in the three Test Sets for different values of α for MSE (the filter coefficient λ in Equation (12) is fixed as 0.7). As shown in Figure 12, we find that

1.
The case α=0, where the magnitude spectrum is kept unchanged in MSE, yields an averaged recognition accuracy of 72.26%, significantly better than the MFCC baseline result (59.75%). This result shows that simply setting the magnitude spectrum of the detected nonspeech frames to be nearly zero is beneficial to the recognition performance.

2.
The recognition accuracy improves as the value α is increased from 0 to 0.6, and the additional improvement in accuracy is 4.80% (from 72.26% to 77.06%). Therefore, amplifying the magnitude spectrum of the speech frames correctly is helpful.

3.
When the exponent α is further increased from 0.6 to 1, the recognition rates worsen, possibly due to the enlarged mismatch among the speech frames mentioned previously. However, the decrease in maximum accuracy is just 0.62% (from 77.06% at α=0.6 to 76.44% at α=0.8), implying that the recognition accuracy is relatively insensitive to α(provided that α is within the range [0.6, 1]).
The effect of the filter coefficient λ in MSE
As stated in Section ‘The magnitude spectrum enhancement (MSE) approach’, the filter coefficient λ in Equation (12) determines the frequency response of the highpass filter for the VAD process of MSE. The case λ=0 corresponds to using the logarithmic magnitude spectrum (logMS) and the logenergy (log E) directly as the VAD features. On the other hand, increasing values of λ indicate that the lower/higher modulation frequency components are further reduced/emphasized in the logMS and log E streams, as shown in Figure 1. This parameter was preliminarily set to 0.7 in the previous experiments. Now, we vary its value from 0 to 0.9, spaced in 0.1 intervals, to perform the corresponding MSE. (Note that setting λ=1 will result in an unstable filter.)
Figure 13 shows the recognition results averaged over five SNR conditions (0 ∼ 20 dB) and all ten noise types in the three Test Sets using different values of λ for MSE (the exponent α in Equation (26) is fixed as 0.5). We first find that applying MSE with a positive λ achieves better results than applying MSE with λ=0 in most cases, indicating that emphasizing the higher modulation frequency components enhances the VAD of MSE. Next, setting λ to 0.8 yields the optimal accuracy rate (77.04%), 0.10% better than the accuracy obtained by setting λ=0.7 (76.94%). Finally, when the value of λ is within the range [0.1, 0.9], the differences among the accuracy rates obtained with different values of λ are relatively small, and the decrease in maximum accuracy is just 1.82%. This result implies that nearly optimal performance can be obtained without meticulous adjustment of the parameter λ.
The effect of processing the short pauses within the utterance in MSE
In the VAD procedure of MSE, each frame in an utterance is always classified as either speech or nonspeech. Therefore, no frame will be classified as a “transient frame”, as is the case for some more delicate VAD processes. In fact, the transient frames that exist in the short region between two connected acoustic units (which are often called “short pauses”) are quite often classified as nonspeech in MSE, and thus their magnitude spectrums are assigned as very small. For this reason, the VAD in MSE is unlike some conventional endpoint detectors, in which only the onset and offset frames of an utterance are decided, while the interword or intersyllable frames that often possess lower energy are not processed. However, we find that further processing of these detected short pauses between the onset and offset times for utterances is quite helpful in speech recognition, especially when the SNR is low. To demonstrate this phenomenon, a simpler form of MSE is designed, in which we only process the first and the last detected nonspeech segments (the corresponding frames are assigned to have very small magnitude spectra) and treat the remaining nonspeech segments as speech (the magnitude spectra of the corresponding frames are weighted as in Equation (24)). This method is called “MSE^{(s)}” here for simplicity, and we compare it with the original MSE with respect to speech recognition performance.
Figure 14 shows the recognition accuracy rates for MSE^{(s)} under different SNR conditions for the three Test Sets. In this figure, we see that almost no performance difference exists between MSE^{(s)} and MSE for the clean condition. However, when noise is present, MSE^{(s)} always performs worse than MSE, and the performance difference becomes more significant as the SNR decreases. On average, MSE^{(s)}is around 4% less effective in recognition accuracy than MSE. In general, in acoustic model training, a short pause model is trained to aid in word or syllable boundary determination and thus to improve the recognition accuracy. However, under noisecorrupted conditions, the short pause model becomes less helpful, as shown in the MSE^{(s)} results. Furthermore, in MSE, we see that to further classify the transient frames as nonspeech (the corresponding magnitude spectrum then becomes very small) within the voiceactivated region of an utterance significantly improves the recognition accuracy for noisecorrupted environments.
Results for the task of multicondition training and multicondition testing
We perform the MSE method, SS_{Berouti} (which performs the best among the three SS methods in Table 2), WF_{PSNR} (which performs the best among the three WF methods in Table 2) and three cepstraldomain methods aforementioned with the Aurora2 database under the mode of “multicondition training”. As stated earlier, here the training data have five SNR conditions (clean, 20, 15, 10, and 5 dB) and four types of noise the same as those in Test Set A. In addition to the individual method, here we also investigate the effect of the pairing of the spectraldomain method and the cepstraldomain method to see if further accuracy improvement can be achieved. Table 5 presents the individual set recognition accuracy rates averaged over five SNR conditions for Test Sets A, B, and C, achieved by the various methods. We have the following findings from Table 5:

1.
For the spectraldomain methods, SS_{Berouti} and WF_{PSNR} degrade the accuracy of the MFCC. The proposed MSE provides the MFCC with around 1% accuracy improvement for Test Sets A and B (additivenoise environments), but it still worsens the recognition accuracy for Test Set C (with both additive noise and channel distortion). A possible explanation is that, these spectraldomain methods introduce distortions and mitigate the discriminative components in the speech features when they alleviate the noise effect in the multicondition training data.

2.
In contrast with the spectraldomain methods, the three cepstraldomain methods can give significant performance improvement over the MFCC baseline. MVA behaves the best, followed by MVN and then HEQ. We find that MVN outperforms HEQ slightly, which is not the case for the mode of cleancondition training as shown in Table 2. This phenomenon is probably because the mismatch between the training data and the testing data is relatively small in the mode of multicondition training, and the overnormalization problem may occur in HEQ, which results in worse accuracy relative to MVN.

3.
None of the three spectraldomain methods, SS_{Berouti}, WF_{PSNR} and MSE, can help the subsequent cepstraldomain method to provide better recognition accuracy rates in comparison with the single cepstraldomain method. These results again imply these spectraldomain methods very probably diminish the helpful speech components in the noisy training data and are inappropriate for the task of multicondition training.
Experiments for the Num100A database
Besides the Aurora2 database, here we adopt another database, called NUM100A [40], to test the performance of the presented MSE. The NUM100A database consists of 8,000 Mandarin digit strings produced by 50 male and 50 female speakers, recorded in a normal laboratory environment at an 8 kHz sampling rate. These 8000 digit strings include 1000 each of two, three, four, five, six, and sevendigit strings, respectively, plus 2000 single digit utterances. Among the 8000 Mandarin digital strings, 7520 with different lengths are selected for training, while the other 480 are for testing. In particular, the 480 clean testing strings are added with four types of noise (white, babble, pink and f16) taken from the NOISEX92 database [44] at four different SNRs (20, 15, 10, and 5 dB) to produce the noisecorrupted testing data. The speech features used here are the same as those in the Aurora2 task, which contain 13 MFCCs (c 1–c 12, c 0) and their delta and deltadelta. With the feature vectors in the training set, the HMMs for each of the 10 digits and silence were trained with the HTK toolkit [45]. Each digit HMM contains five states and eight mixtures per state, and the silence HMM has three states and eight mixtures per state.
For simplicity, we use the MSE with the same parameter settings in Aurora2 task to process the training and testing signals and to create the corresponding MFCC features. In addition, since we just intend to investigate if MSE is also helpful to improve the noisy speech recognition for another database besides Aurora2, we do not perform the other spectraldomain methods like SS and WF, and simply choose one cepstraldomain method, MVN, for processing the MFCC features.
Figures 15 and 16a–d show the recognition accuracy rates for the four methods, MFCC baseline, MSE, MVN and the pairing of MSE and MVN, under the clean and four noisecorrupted situations with different SNRs. From these figures, we have the following findings:

1.
Under the clean and matched condition, both MSE and MVN degrades the recognition rate of the MFCC slightly, and the combination of MSE and MVN gets the worst results. These results imply that the robustness methods can probably reduce the discriminability of the original features when the environment is noisefree.

2.
The recognition accuracy of the original MFCC gets apparently worse at mismatched noisy situations. However, the presented MSE can enhance the MFCC and bring about significant accuracy improvement irrespective of the type of noise. For example, at the SNR of 10 dB, MSE provides the MFCC with the accuracy rate improvements of 20.09%, 53.42%, 26.77%, and 41.74% for the noise being white, babble, pink and f16, respectively. Therefore, we show that MSE works well as a noiserobustness approach for this Mandarin digit database in addition to Aurora2.

3.
MVN promotes the recognition accuracy very well relative to the MFCC baseline when the environment is noisy, and it outperforms MSE in most cases. However, the cascade of MSE and MVN performs better than MVN alone (except for the babble and f16 noises at the SNR of 20 dB), showing again that MSE is well additive to the cepstraldomain method, MVN.
Conclusions
In this article, we investigate the effect of additive noise on the linear and logarithmic spectra of noisecorrupted utterances and provide a compensation scheme, called magnitude spectrum enhancement (MSE), to enhance the noise robustness of speech features. MSE aims to shrink the magnitude spectra in the silence portion of an utterance and to strengthen them in the speech portion. Experimental results show that MSE is very effective in promoting recognition performance under various noise conditions for the Aurora2 cleancondition training task, and its performance is greater than that of spectral subtraction and Wiener filtering. Furthermore, MSE can successfully be implemented additively to cepstraldomain methods to deliver even better recognition rates.
Appendix 1
Given that A=Ae^{jϕ} is a complexvalued constant, and N=_{ N R } + j_{ N I } is a complexvalued random variable which real and imaginary parts, _{ N R }and _{ N I }, are independent Gaussian distributed with zero mean and a common variance σ^{2}, then it can be shown that [38]:

1.
The random variable A + N is Rician distributed, and its probability density function (pdf) is
$${f}_{A+N}\left(x\right)=\frac{x}{{\sigma}^{2}}exp\left(\frac{({x}^{2}+A{}^{2})}{2{\sigma}^{2}}\right){I}_{0}\left(\frac{A{}^{2}}{{\sigma}^{2}}x\right)u\left(x\right),$$(27)where _{I 0}(.) is the modified Bessel function of the first kind with order zero, and u(.) is the unitstep function.

2.
The random variable N is Rayleigh distributed, and its probability density function (pdf) is
$${f}_{\leftN\right}\left(x\right)=\frac{x}{{\sigma}^{2}}exp\left(\frac{{x}^{2}}{2{\sigma}^{2}}\right)u\left(x\right).$$(28)
Therefore, the items _{ S p }k + _{ D p }k and _{ D q }k in Equation (3) are Rician and Rayleigh distributed, respectively. Furthermore, assuming _{ D p }k and _{ D q }k are statistically independent (since they correspond to different frames) and identically distributed, we have Equation (3) as
where _{1}_{F 1}(. , . , .) is the confluent hypergeometric function [38]:
in which _{I 1}(.) is the modified Bessel function of the first kind with order one.
It can be shown that
and since $\frac{d}{\mathrm{dx}}{I}_{0}\left(x\right)={I}_{1}\left(x\right)$ and $\frac{d}{\mathrm{dx}}\left(x{I}_{1}\right(x\left)\right)=x{I}_{0}\left(x\right)$, we have
Therefore, ${\phantom{\rule{0.1em}{0ex}}}_{1}{F}_{1}(\frac{1}{2};1;x)$ is a positive and monotonically increasing function for x>0, and we conclude that the parameter $\gamma \left[k\right]={\frac{\Pi}{2}}_{1}{F}_{1}(\frac{1}{2};1;\frac{\left{S}_{p}\right[k]{}^{2}}{2{\sigma}^{2}})$ in Equation (29) decreases as the noise variance σ^{2} increases (with decreasing $\frac{1}{{\sigma}^{2}}$) , with two limiting cases $\underset{{\sigma}^{2}\to 0}{\mathrm{lim}}\gamma \left[k\right]=\infty $ and $\underset{{\sigma}^{2}\to \infty}{\mathrm{lim}}\gamma \left[k\right]=\frac{\Pi}{2}$.
References
 1.
Holmes JN, Sedgwick NC: Noise compensation for speech recognition using probabilistic models. 1986 International Conference on Acoustics, Speech and Signal Processing (ICASSP’86), vol. 11 (Tokyo, Japan, 1986), pp. 741–744
 2.
Klatt DH: A digital filter bank for spectral matching. 1979 International Conference on Acoustics, Speech and Signal Processing (ICASSP’76), vol. 1 (Philadelphia, USA, 1976), pp. 573–576
 3.
Nadas A, Nahamoo D, Picheny M: Speech recognition using noiseadaptive prototypes. 1988 International Conference on Acoustics, Speech and Signal Processing (ICASSP’88), vol. 1 (New York, USA, 1988), pp. 517–520
 4.
Varga AP, Moore RK: Hidden Markov model decomposition of speech and noise. 1990 International Conference on Acoustics, Speech and Signal Processing (ICASSP’90), vol. 2 (Albuquerque, USA, 1990), pp. 845–848
 5.
Acero A, Deng L, Kristjansson T, Zhang J: HMM adaptation using vector Taylor series for noisy speech recognition. 2000 International Conference on Spoken Language Processing (ICSLP’00), vol. 3 (Beijing, China, 2000), pp. 869–872
 6.
Leggester CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density HMMs. Comput. Speech Lang 1995, 9: 171186. 10.1006/csla.1995.0010
 7.
Sankar A, Lee CH: A maximumlikelihood approach to stochastic matching for robust speech recognition. IEEE Trans. Speech Audio Process 1996, 4: 190202. 10.1109/89.496215
 8.
Lee CH: On stochastic feature and model compensation approaches to robust speech recognition. Speech Commun 1998, 25: 2947. 10.1016/S01676393(98)000284
 9.
Ning GX, Wei G, Chu KK: Model compensation approach based on nonuniform spectral compression features for noisy speech recognition. EURASIP J. Adv. Signal Process 2007., 2007:
 10.
Moreno PJ, Raj B, Stern RM: Datadriven environmental compensation for speech recognition: a unified approach. Speech Commun 1998, 24: 267285. 10.1016/S01676393(98)000259
 11.
Gales MJF, Young SJ: Cepstral parameter compensation for HMM recognition in noise. Speech Commun 1993, 12: 231239. 10.1016/01676393(93)90093Z
 12.
Gales MJF, Young SJ: Robust speech recognition in additive and convolutional noise using parallel model combination. Comput. Speech Lang 1995, 9: 289307. 10.1006/csla.1995.0014
 13.
Gales MJF, Young SJ: A fast and flexible implementation of parallel model combination. 1995 International Conference on Acoustics, Speech and Signal Processing (ICASSP’95), vol. 1 (Detroit, USA, 1995), pp. 133–136
 14.
Boll SF: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process 1979, 27: 113120. 10.1109/TASSP.1979.1163209
 15.
Berouti M, Schwartz R, Makhoul J: Enhancement of speech corrupted by acoustic noise. 1979 International Conference on Acoustics, Speech and Signal Processing (ICASSP’79), vol. 4 (Washington, USA, 1979), pp. 208–211
 16.
Kamath S, Loizou P: A multiband spectral subtraction method for enhancing speech corrupted by colored noise. 2002 International Conference on Acoustics, Speech and Signal Processing (ICASSP’02), vol. 4 (Orlando, USA, 2002), pp. IV–4164
 17.
BabaAli B, Sameti H, Safayani M: Likelihood maximizing based multiband spectral subtraction for robust speech recognition. EURASIP J. Adv. Signal Process 2009., 2009:
 18.
Scalart P, Filho JV: Speech enhancement based on a priori signal to noise estimation. 1996 International Conference on Acoustics, Speech and Signal Processing (ICASSP’96), vol. 2 (Atlanta, USA, 1996), pp. 629–632
 19.
Plapous C, Marro C, Scalart P: Improved signaltonoise ratio estimation for speech enhancement. IEEE Trans. Audio Speech Lang. Process 2006, 14: 20982108.
 20.
Ephraim Y, Malah D: Speech enhancement using a minimum meansquare error shorttime spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process 1984, 32: 11091121. 10.1109/TASSP.1984.1164453
 21.
Ephraim Y, Malah D: Speech enhancement using a minimum meansquare error logspectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process 1985, 33: 443445. 10.1109/TASSP.1985.1164550
 22.
Acero A: Acoustical and environmental robustness in automatic speech recognition,. Ph.D. dissertation, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburg, PA (1990)
 23.
Chu KK, Leung SH: SNRdependent nonuniform spectral compression for noisy speech recognition. 2004 International Conference on Acoustics, Speech and Signal Processing (ICASSP’04), vol. 1 (Montreal, Canada, 2004), pp. 973–976
 24.
Deng L, Acero A, Jiang L, Droppo J, Huang X: Highperformance robust speech recognition using stereo training data. 2001 International Conference on Acoustics, Speech and Signal Processing (ICASSP’01), vol. 1 (Salt Lake City, USA, 2001), pp. 301–304
 25.
Droppo J, Deng L, Acero A: Evaluation of the SPLICE algorithm on the Aurora2 database. 2001 Eurospeech Conference on Speech Communications and Technology (Eurospeech’01), vol. 1 (Aalborg, Denmark, 2001), pp. 185–188
 26.
Atal BS: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J. Acoust. Soc. Am 1974, 55: 13041312. 10.1121/1.1914702
 27.
Tibrewala S, Hermansky H: Multiband and adaptation approaches to robust speech recognition. 1997 Eurospeech Conference on Speech Communications and Technology (Eurospeech’97), vol. 1 (Rhodes, Greece, 1997), pp. 2619–2622
 28.
Chen CP, Bilmes JA: MVA processing of speech features. IEEE Trans. Audio Speech Lang. Process 2007, 15: 257270.
 29.
Yoshizawa S, Hayasaka N, Wada N, Miyanaga Y: Cepstral gain normalization for noise robust speech recognition. 2004 International Conference on Acoustics, Speech and Signal Processing (ICASSP’04), vol. 1 (Montreal, Canada, 2004), pp. 209–212
 30.
Hilger F, Ney H: Quantile based histogram equalization for noise robust large vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process 2006, 14: 845854.
 31.
Suh Y, Kim H: Histogram equalization to model adaptation for robust speech recognition. EURASIP J. Adv. Signal Process 2010., 2010:
 32.
Du J, Wang RH: Cepstral shape normalization (CSN) for robust speech recognition. 2008 International Conference on Acoustics, Speech and Signal Processing (ICASSP’08), vol. 1 (Las Vegas, USA, 2008), pp. 4389–4392
 33.
Zhu W, O’Shaughnessy D: Logenergy dynamic range normalization for robust speech recognition. 2005 International Conference on Acoustics, Speech and Signal Processing (ICASSP’05), vol. 1 (Philadelphia, USA , 2005), pp. 245–248
 34.
Hwang TH, Chang SC: Energy contour enhancement for noisy speech recognition. 2004 International Symposium on Chinese Spoken Language Processing (ISCSLP’04), vol. 1 (Hong Kong, China, 2004), pp. 249–252
 35.
Wang CC, Pan CA, Hung JW: Silence feature normalization for robust speech recognition in additive noise environments. 2008 International Conference on Spoken Language Processing (Interspeech 2008ICSLP), vol. 1 (Brisbane, Australia, 2008), pp. 1028–1031
 36.
Tu WH, Hung JW: Magnitude spectrum enhancement for robust speech recognition. 2010 International Conference on Acoustics, Speech and Signal Processing (ICASSP’10), vol. 1 (Dallas, USA, 2010), pp. 4586–4589
 37.
Hirsch HG, Pearce D: The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions. ISCA ITRW ASR2000 “Automatic Speech Recognition: Challenges for the Next Millennium”, vol. 1 (Paris, France, 2000), pp. 181–188
 38.
Haykin S: Communication Systems,. (John Wiley & Sons, Inc., New York, 2000)
 39.
Turin GL: An introduction to matched filters. IRE Trans. Inf. Theory 1960, 6: 311329. 10.1109/TIT.1960.1057571
 40.
Available from:: , the Association for Computational Linguistics and Chinese Language Processing (ACLCLP). http://www.aclclp.org.tw
 41.
ITU recommendation G.712: Transmission performance characteristics of pulse code modulation channels. 1996.
 42.
Available from:: , Evaluations and Language resources Distribution Agency (ELDA). http://www.elda.org/article52.html
 43.
Hogg RV, Tanis EA: Probability and Statistical Inference,. (Prentice Hall, Upper Saddle River, NJ, 2006)
 44.
Varga AP, Steeneken HJM, Tomlinson M, Jones D: The NOISEX92 study on the effect of additive noise on automatic speech recognition,. Tech. Rep. DRA Speech Research Unit, 1992
 45.
Available from:: , the Hidden Markov Model Toolkit, Cambridge University Engineering Dept. (CUED). http://htk.eng.cam.ac.uk/
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Hung, J., Fan, H. & Tu, W. Enhancing the magnitude spectrum of speech features for robust speech recognition. EURASIP J. Adv. Signal Process. 2012, 189 (2012). https://doi.org/10.1186/168761802012189
Received:
Accepted:
Published:
Keywords
 Voice activity detection
 Robust speech recognition
 Speech enhancement