Enhancing the magnitude spectrum of speech features for robust speech recognition

Hung, Jeih-weih; Fan, Hao-teng; Tu, Wen-hsiang

doi:10.1186/1687-6180-2012-189

Research
Open access
Published: 30 August 2012

Enhancing the magnitude spectrum of speech features for robust speech recognition

Jeih-weih Hung¹,
Hao-teng Fan¹ &
Wen-hsiang Tu¹

EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 189 (2012) Cite this article

3133 Accesses
2 Citations
Metrics details

Abstract

In this article, we present an effective compensation scheme to improve noise robustness for the spectra of speech signals. In this compensation scheme, called magnitude spectrum enhancement (MSE), a voice activity detection (VAD) process is performed on the frame sequence of the utterance. The magnitude spectra of non-speech frames are then reduced while those of speech frames are amplified. In experiments conducted on the Aurora-2 noisy digits database, MSE achieves an error reduction rate of nearly 42% relative to baseline processing. This method outperforms well-known spectral-domain speech enhancement techniques, including spectral subtraction (SS) and Wiener filtering (WF). In addition, the proposed MSE can be integrated with cepstral-domain robustness methods, such as mean and variance normalization (MVN) and histogram normalization (HEQ), to achieve further improvements in recognition accuracy under noise-corrupted environments.

Introduction

The environmental mismatch caused by additive noise and/or channel distortion often seriously degrades the performance of speech recognition systems. Various robustness techniques have been proposed to reduce this mismatch, which can be roughly divided into two classes: model-based and feature-based approaches. In model-based approaches, compensation is performed on the pre-trained recognition model parameters so that the modified recognition models can more effectively classify the mismatched test speech features collected in the application environment. Typical examples of this class include noise masking [1–3], speech and noise decomposition (SND) [4], vector Taylor series (VTS) [5], maximum likelihood linear regression (MLLR) [6], model-based stochastic matching [7, 8], model compensation based on non-uniform spectral compression (MC-SNSC) [9], statistical re-estimation (STAR) [10], and parallel model combination (PMC) [11–13] methods. In the feature-based approaches, a noise-robust feature representation is developed to reduce the sensitivity to various acoustic conditions and thereby alleviate the mismatch between those features used for training and testing. Examples of this class include spectral subtraction (SS) [14–17], Weiner filtering [18, 19], short-time spectral amplitude estimation based on minimum mean-squared error criteria (MMSE-STSA) [20], MMSE-based log-spectral amplitude estimation (MMSE log-STSA) [21], codeword-dependent cepstral normalization (CDCN) [22], SNR-dependent non-uniform spectral compression scheme (SNSC) [23], feature-based stochastic matching [7, 8], multivariate Gaussian-based cepstral normalization (RATZ) [10], stereo-based piecewise linear compensation for environments (SPLICE) [24, 25] methods, and a series of cepstral-feature statistics normalization techniques such as cepstral mean subtraction (CMS) [26], cepstral mean and variance normalization (MVN) [27], MVN plus ARMA filtering (MVA) [28], cepstral gain normalization (CGN) [29], histogram equalization (HEQ) [30, 31], and cepstral shape normalization (CSN) [32]. A common advantage of the feature-based methods is their relative simplicity of implementation. This simplicity arises because all of these methods focus on front-end speech feature processing without any need to change the back-end model training and recognition schemes. Despite their simplicity, these methods usually improve recognition performance significantly in noise-corrupted application environments.

The mel-frequency cepstral coefficient (MFCC) is one of the most widely used speech feature representations due to its high recognition performance under clean conditions. However, MFCC is not very noise-robust, and thus many robustness techniques mentioned above can be applied in various domains of a speech signal when deriving MFCC. For example, SS, WF, MMSE-STSA, and MMSE log-STSA techniques are used in the spectral domain whereas CMS, MVN, MVA, and HEQ are often used in the cepstral domain. In particular, the method presented in this article is designed to compensate the spectrum of the speech signal to obtain more noise-robust MFCC.

In addition to MFCC features, the energy-related feature, i.e., the logarithmic energy (log E), is also effective in discriminating different phonemes. For this reason, it is often appended to the MFCC features to further enhance recognition performance. However, similar to MFCC, the log E feature is vulnerable to noise. In many recent studies [33–35], it has been found that compensating the log E feature can improve the recognition accuracy significantly under noisy conditions. For example, in our previously proposed method, silence feature normalization (SFN) [35], high-pass-filtered log E is used as the indicator for speech/non-speech frame classification, and the log E features of non-speech frames are set to be small, while those of speech frames are kept nearly unchanged. We have shown that SFN is very effective despite its simplicity in implementation.

Partially inspired by the concept of SFN, in our previous work [36] we presented another approach, called magnitude spectrum enhancement (MSE) to further process the magnitude spectra of speech frames. Initial experiments shown in [36] have indicated that MSE produced good results on the Aurora-2 evaluation task [37]. The main purpose of this article is to provide a rigorous investigation for the background of MSE, as well as a series of experiments to further show the effectiveness of MSE in reducing the effect of noise for speech recognition. In MSE, the noise-corrupted signal is processed in the linear spectral domain, with the hope that the resulting speech features are more noise-robust. Briefly speaking, in MSE, the magnitude spectrum of each non-speech frame is set to be small (as in SFN), whereas the magnitude spectrum of each speech frame is amplified by multiplying by a weighting factor that is related to the signal-to-noise ratio (SNR). The main purpose of MSE is to highlight the spectral difference between the speech and non-speech frames, not to re-construct the clean speech spectrum as SS and WF do. The experiments conducted on the Aurora-2 digit database show that our proposed MSE can provide a significant improvement in recognition accuracy in various noise-corrupted environments. MSE performs better than many spectral-domain methods, and it can be well integrated with cepstral-domain processing techniques, such as MVN, MVA, and HEQ. The best possible average accuracy rate for the Aurora-2 clean-condition training task with the proposed method can be as high as 83.80%.

The remainder of this article is organized as follows. Section ‘Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal’ provides a mathematical analysis of a noise-corrupted speech signal as background knowledge for the presented MSE. Next, detailed MSE procedures are described in Section ‘The magnitude spectrum enhancement (MSE) approach’. Section ‘Experimental results and discussions, contains the experimental setup and a series of experimental results together with the corresponding discussions. Finally, the concluding remarks are given in Section ‘Conclusions’.

Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal

In this section, we provide a mathematical analysis for the effects of additive noise to the linear and logarithmic magnitude spectrum in a speech signal. Observing these effects will help us develop and present the new noise-robustness approach in Section ‘The magnitude spectrum enhancement (MSE) approach’.

Effect of additive noise on the magnitude spectra of speech/non-speech frames

Assume that the signal for an arbitrary frame of a noise-corrupted utterance can be represented by

x_{m} [n] = s_{m} [n] + d_{m} [n], 0 \leq m \leq M - 1,

(1)

where m is the frame index, M is the total number of frames, and s_mn and d_mn are the speech and noise components of x_mn, respectively. Taking the discrete Fourier transform (DFT) on both sides of Equation (1), we have

X_{m} [k] = S_{m} [k] + D_{m} [k],

(2)

where X_mk, S_mk, and D_mk represent the spectra of x_mn, s_mn, and d_mn, respectively, for the k th frequency bin. Obviously, the speech component S_mk approaches zero in Equation (2) for a non-speech frame. Here, a parameter called the magnitude spectral ratio (MSR) is defined as

γ [k] = E (\frac{|S_{p} [k] + D_{p} [k]|}{| D_{q} [k] |}), p \neq q,

(3)

and represents the expectation of the ratio of a speech frame (frame p) to a non-speech frame (frame q) in the magnitude spectrum for the k th frequency bin. It can be shown that, under an additive white Gaussian noise (AWGN) environment and assuming that S_pk is a constant, |S_pk + D_pk| and |D_qk| in Equation (3) are two random variables with Rician and Rayleigh distributions [38], respectively. The parameter MSR in Equation (3) is then

\begin{align} γ [k] & = \frac{Π}{2} exp (- \frac{| S_{p} [k] |^{2}}{4 σ^{2}}) ((1 + \frac{| S_{p} [k] |^{2}}{2 σ^{2}}) I_{0} (\frac{| S_{p} [k] |^{2}}{4 σ^{2}}) \\ + \frac{| S_{p} [k] |^{2}}{2 σ^{2}} I_{1} (\frac{| S_{p} [k] |^{2}}{4 σ^{2}})), \end{align}

(4)

where σ² is the variance of the real- and imaginary- parts of the noise D_mk, m=p, q, and I₀(.) and I₁(.) are the modified Bessel functions of the first kind with orders zero and one, respectively. Furthermore, γ k in Equation (4) is, in fact, monotonically decreasing with respect to the noise variance σ² (see Appendix 1 for a detailed analysis of the above results), indicating that speech frames become increasingly indistinguishable from non-speech frames based on their magnitude spectra as the signal-to-noise ratio (SNR) decreases.

Effect of additive noise on the logarithmic magnitude spectrum in the frame sequences

First, we investigate the effect of noise on the logarithmic magnitude spectrum in an arbitrary frame within an utterance. According to Equation (2), we have

\begin{align} X_{m}^{(l)} [k] & = log (| X_{m} [k] |) \\ = 0.5 log (| X_{m} [k] |^{2}) \\ = 0.5 log (| S_{m} [k] + D_{m} [k] |^{2}) \\ \approx 0.5 log (| S_{m} [k] |^{2} + | D_{m} [k] |^{2}) \\ = 0.5 log (exp (2 S_{m}^{(l)} [k]) + exp (2 D_{m}^{(l)} [k])), \end{align}

(5)

where $X_{m}^{(l)} [k]$ , $S_{m}^{(l)} [k]$ , and $D_{m}^{(l)} [k]$ are the logarithmic magnitude spectra of x_m[n], s_m[n], and d_m[n], respectively, from Equation (1). Thus, the difference between $X_{m}^{(l)} [k]$ (for the noise-corrupted speech) and $S_{m}^{(l)} [k]$ (for the embedded clean speech) is

\begin{align} Δ [k] & = X_{m}^{(l)} [k] - S_{m}^{(l)} [k] \\ \approx 0.5 log (1 + \frac{exp (2 D_{m}^{(l)} [k])}{exp (2 S_{m}^{(l)} [k])}) \\ = 0.5 log (1 + \frac{| D_{m} [k] |^{2}}{| S_{m} [k] |^{2}}), \end{align}

(6)

From Equation (6), it is obvious that under the same noise magnitude level |D_m[k]|, the difference Δ[k] decreases as the speech magnitude |S_m[k]| increases. Therefore, for a noise-corrupted utterance, the logarithmic magnitude spectrum of the speech frame is often less vulnerable to noise than that of the non-speech (noise-only) frame. However, this condition does not hold for the (linear) magnitude spectrum.

Next, let us consider the effect of noise on the frame sequence of logarithmic magnitude spectra, denoted by ${X_{m}^{(l)} [k]}_{m = 0}^{M - 1}$ , for the utterance. Taking the Taylor series approximation of Equation (5) with respect to $(S_{m}^{(l)} [k], D_{m}^{(l)} [k]) = (0, 0)$ up to order 2, we have

\begin{align} X_{m}^{(l)} [k] & = 0.5 log (exp (2 S_{m}^{(l)} [k]) + exp (2 D_{m}^{(l)} [k])) \\ \approx 0.5 log 2 + 0.5 (S_{m}^{(l)} [k] + D_{m}^{(l)} [k]) + 0.25 ({(S_{m}^{(l)} [k])}^{2} \\ + {(D_{m}^{(l)} [k])}^{2} - 2 S_{m}^{(l)} [k] D_{m}^{(l)} [k]) . \end{align}

(7)

Thus the modulation spectrum M_X(jω) of the sequence ${X_{m}^{(l)} [k]}_{m = 0}^{M - 1}$ , computed by

M_{X} (jω) = \sum_{m = 0}^{M - 1} X_{m}^{(l)} [k] e^{- jωm},

(8)

can be approximated as

\begin{align} M_{X} (jω) & \approx (Π log 2) δ (ω) + 0.5 (M_{S} (jω) + M_{D} (jω)) \\ + \frac{1}{8 Π} (M_{S} (jω) * M_{S} (jω) + M_{D} (jω) * M_{D} (jω) \\ - 2 M_{S} (jω) * M_{D} (jω)), \end{align}

(9)

where M_X(jω), M_S(jω), and M_D(jω) are discrete-time Fourier transforms (DTFTs) of ${X_{m}^{(l)} [k]}_{m = 0}^{M - 1}$ , ${S_{m}^{(l)} [k]}_{m = 0}^{M - 1}$ , and ${D_{m}^{(l)} [k]}_{m = 0}^{M - 1}$ (along the frame axis with the index m, as in Equation (8)), respectively, and the symbol “∗” denotes the convolution operation. If the two sequences, ${S_{m}^{(l)} [k]}_{m = 0}^{M - 1}$ and ${D_{m}^{(l)} [k]}_{m = 0}^{M - 1}$ , are both low-pass and their bandwidths are B_sand B_d, respectively, then the terms M_D(jω) ∗ M_D(jω) and M_S(jω) ∗ M_D(jω) in Equation (9) have bandwidths of 2B_d and B_s + B_d, respectively. This finding implies that ${X_{m}^{(l)} [k]}_{m = 0}^{M - 1}$ has a wider bandwidth than ${D_{m}^{(l)} [k]}_{m = 0}^{M - 1}$ . In other words, the logarithmic magnitude spectrum of the noise-corrupted speech segment possesses higher modulation frequency components than that of the noise-only segment in a noisy utterance. Again, this condition does not hold for the (linear) magnitude spectrum.

Note: it is easy to demonstrate that the above analysis of the logarithmic magnitude spectrum can be performed on the logarithmic energy (log E) sequence in an utterance, obtaining the same conclusions [35]. That is,

1.
The logarithmic energy is less distorted in a speech frame than in a non-speech frame.
2.
For the logarithmic energy sequence of a noisy utterance, the speech segment possesses components of even-higher frequency than the non-speech segment.

The magnitude spectrum enhancement (MSE) approach

In this section, we describe a compensation scheme termed magnitude spectrum enhancement (MSE) [36] in order to improve the noise robustness of speech features. Briefly speaking, the magnitude spectra of the speech frames are enlarged in MSE whereas those of the non-speech frames are normalized to be very small. In addition, the speech/non-speech frame classification in this scheme is based on the logarithmic magnitude spectra and the logarithmic energy feature of the frames. Details of the MSE procedure are stated below.

Following the notations introduced in Section ‘Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal’, here {x_m[n], 0≤n≤N−1} is the time-domain signal for the m th frame of an utterance and N is the frame length. The spectrum for this frame is calculated as

X_{m} [k] = \sum_{n = 0}^{N - 1} x_{m} [n] e^{- j \frac{2 Π n k}{K}}, 0 \leq k \leq ⌊ \frac{K}{2} ⌋, 0 \leq m \leq M - 1,

(10)

where K is the DFT size, and M is total number of frames in this utterance. Thus, |X_m[k]| represents the magnitude spectrum for the k th frequency bin of the m th frame. In addition, the logarithmic energy (log E) feature of the m th frame is given by

e_{m} = log (\sum_{n = 0}^{N - 1} x_{m}^{2} [n]), 0 \leq m \leq M - 1 .

(11)

The proposed magnitude spectrum enhancement (MSE) approach uses the following two steps to create the new magnitude spectrum.

Step I: Perform voice activity detection (VAD):

The VAD process that discriminates speech frames from non-speech frames in an utterance is based on two sources: the logarithmic magnitude spectrum (abbreviated as logMS) in Equation (10) and log E in Equation (11). Based on the observations made in Section ‘Effect of additive noise on the logarithmic magnitude spectrum in the frame sequences’, noise-corrupted speech segments possess a greater number of high (modulation) frequency components in the logMS and log E sequence than noise-only segments, and thus we expect that the high-pass-filtered logMS and log E sequences help to obtain more accurate VAD results.

As for the first source, we process the logMS sequence ${log (| X_{m} [k] |)}_{m = 0}^{M - 1}$ with a high-pass IIR filter with an input-output relationship of

\begin{matrix} Y_{m} [k] = log (| X_{m} [k] |) - λ Y_{m - 1} [k], 0 \leq k \leq ⌊ \frac{K}{2} ⌋, 0 \leq m \leq M - 1, \end{matrix}

(12)

where 0≤λ<1 (the case λ=1 leads to an unstable filter). The frequency response (magnitude part) of the high-pass IIR filter is depicted in Figure 1, showing that this filter emphasizes the higher frequency portions while not eliminating the near-DC components completely.

Next, we sum up the high-pass filtered logarithmic spectrum, Y_m[k], over the entire frequency band for each frame:

z_{m} = \sum_{k = 0}^{⌊ \frac{K}{2} ⌋} Y_{m} [k] .

(13)

Thus, z_m in Equation (13) is viewed as the cumulative high-pass-filtered logarithmic spectral magnitude of the m th frame. Finally, the first speech/non-speech decision parameter d_m,1 is obtained as follows:

d_{m, 1} = \{\begin{matrix} 1 if z_{m} \geq θ_{z} \\ 0 otherwise \end{matrix}, 0 \leq m \leq M - 1,

(14)

where the threshold θ_z is simply set to the mean of the stream {z_m, 0≤m≤M−1}.

As for the second source (the log E sequence) for the VAD process, we obtain the second speech/non-speech decision parameter d_m,2for the m th frame,

d_{m, 2} = \{\begin{matrix} 1 if e_{m}^{(h)} \geq θ_{e} \\ 0 otherwise \end{matrix}, 0 \leq m \leq M - 1,

(15)

where $e_{m}^{(h)}$ is the high-pass filtered version of e_m in Equation (11), in which the high-pass IIR filter is the same as that used in Equation (12). Again, the threshold θ_e is set to the mean of the stream ${e_{m}^{(h)}, 0 \leq m \leq M - 1}$ .

Finally, the result of the VAD process is obtained from the two parameters d_m,1 in Equation (14) and d_m,2 in Equation (15):

\begin{matrix} d_{m} = \{\begin{array}{l} 1 if d_{m, 1} = 1 or d_{m, 2} = 1 \\ 0 otherwise \end{array}, 0 \leq m \leq M - 1, \end{matrix}

(16)

where d_m is the VAD indicator finally used. That is, the m th frame is classified as speech if either d_m,1 or d_m,2 is equal to unity. The main reason for using the “or” operation in Equation (16) is that the speech frames are likely to be misclassified as non-speech frames (i.e., a higher false-rejection rate) when we simply depend on either decision parameter d_m,1 or d_m,2 alone, especially when the SNR degrades.

Step II. Obtain the enhanced magnitude spectrum

This step amplifies the magnitude spectrum for the speech frames while diminishing it for the non-speech frames. The main purpose of this step is to enlarge the ratio of speech frames to non-speech frames in magnitude spectra to reduce the noise effect, as discussed in Section ‘Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal’. The magnitude spectra for the non-speech frames detected in Step I are first collected and then averaged to obtain the estimated noise (magnitude) spectrum for the utterance:

N [k] = \frac{\sum_{m = 0}^{M - 1} (1 - d_{m}) |X_{m}, [k]|}{\sum_{m = 0}^{M - 1} (1 - d_{m})}, 0 \leq k \leq ⌊ \frac{K}{2} ⌋ .

(17)

Note that here, N[k] is independent of the frame index m. Thus, the noise spectrum is estimated once for the utterance.

Next, a weighting factor for each magnitude spectral value X_m[k] is defined as follows:

\begin{matrix} w_{m} [k] = \{\begin{matrix} {(\frac{| X_{m} [k] |}{N [k] + δ})}^{α} if d_{m} = 1 \\ ε if d_{m} = 0 \end{matrix}, 0 \leq k \leq ⌊\frac{K}{2}⌋, 0 \leq m \leq M - 1, \end{matrix}

(18)

where α is a parameter within the range [0,1] that determines the degree of amplification, δ is a small positive constant that avoids the weighting factor becoming infinitely large as N[k]→0, and ε is a very small positive random variable such that the magnitude spectra of detected non-speech frames are significantly reduced.

Thus, the weighting factor for a speech frame (d_m=1) in Equation (18) is related to the SNR as follows:

w_{m} [k] \approx {(\sqrt{SN R_{m} [k]} + 1)}^{α},

(19)

where $SN R_{m} [k] = (\frac{| X_{m} [k] |^{2}}{N^{2} [k]}) - 1$ is the (estimated) SNR for the k th frequency bin of the m th frame.

Finally, the enhanced magnitude spectrum is obtained by multiplying the original magnitude spectrum with the weighting factor w_m[k] in Equation (18):

| {\tilde{X}}_{m} [k] | = w_{m} [k] | X_{m} [k] |, 0 \leq k \leq ⌊ \frac{K}{2} ⌋, 0 \leq m \leq M - 1,

(20)

The proposed MSE has the following properties:

1.
In MSE, the embedded VAD process uses the logarithmic magnitude spectrum rather than the linear magnitude spectrum. According to the discussions in Section ‘Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal’, the logarithmic magnitude spectrum is less vulnerable to noise in speech frames, and its temporal-domain sequence exhibits a wider (modulation) spectral bandwidth in speech portions than in non-speech portions. Based on these two characteristics, the logarithmic magnitude spectrum is a more appropriate VAD indicator than the linear magnitude spectrum. The experimental results shown later will reveal that the logarithmic magnitude spectrum outperforms the linear magnitude spectrum for providing MSE with better recognition accuracy.
2.
By assigning different weights to the magnitude spectra of speech and non-speech frames, the speech portions of an utterance are highlighted and the difference between the speech and non-speech portions in magnitude spectrum is strongly emphasized. This effect leads to a large magnitude spectral ratio (MSR) as defined in Equation (2) and implies that the effect of noise has been effectively reduced.
3.
The idea of MSE is partially motivated by the matched filter theory in the field of communications [38]. For an observed signal denoted by x n=s n + d n, where s n and d n are the desired signal and additive noise, respectively, the magnitude (frequency) response of the matched filter which maximizes the output SNR is [39]:
$| H (jω) | = \frac{| S (jω) |}{P_{d} (ω)},$
(21)

where |S(jω)| and P_d(ω) are the magnitude spectrum of s n and the power spectral density of the noise d n, respectively. From Equation (21), we find |H(jω)| is proportional to the input frequency domain SNR (defined by $\frac{| S (jω) |^{2}}{P_{d} (ω)}$ ) provided the signal level |S(jω)|² or the noise level P_d(ω) is fixed. Thus, MSE shares the idea of the matched filter and uses a spectral weighting factor w_mk in Equation (18) which is positively correlated with the SNR. However, MSE differs from the matched filter in some aspects: First, MSE applies the magnitude spectrum of the noisy signal x n rather than that of the clean signal s n, which is not available and requires estimation. Second, the magnitude spectrum of the noise d n is used, which approximates the square root of the power spectral density of the noise. Finally, MSE additionally detects the non-speech regions and makes the corresponding spectra nearly zero, which is a non-linear operation and can further distinguish the speech and non-speech frames.

4.
Compared to the SFN method [35], the magnitude spectrum in MSE for the features in non-speech portions is set to be small. However, in speech portions of the utterance, MSE further amplifies the magnitude spectrum, whereas in SFN the energy-related feature is kept nearly unchanged.
5.
Like spectral compensation techniques, spectral subtraction (SS) [14–16] and Weiner filtering (WF) [18, 19], MSE attempts to reduce the effect of noise in the spectral domain of speech signals. However, the main purpose of SS and WF is to restore a clean spectrum from the noise-corrupted utterance. This situation contrasts with MSE, where the (magnitude) spectrum of the speech portions is amplified, possibly making the resulting spectrum quite different from the clean spectrum. In general, the updated magnitude spectra using SS and WF are often presented as follows:
$SS: | {\tilde{X}}_{m} [k] | \approx | X_{m} [k] | {(1 + \frac{1}{SN R_{m} [k]})}^{- \frac{1}{2}},$
(22)

WF: | {\tilde{X}}_{m} [k] | \approx | X_{m} [k] | {(1 + \frac{1}{SN R_{m} [k]})}^{- 1} .

(23)

For MSE, the new magnitude spectrum is:

\begin{array}{l} MSE: | & {\tilde{X}}_{m} [k] | \approx | X_{m} [k] | {(\sqrt{SN R_{m} [k]} + 1)}^{α} \\ \times (for speech frames). \end{array}

(24)

In addition, the speech and non-speech portions are treated quite differently in MSE (as shown in Equations (18) and (20)), while they are not explicitly treated differently in SS and WF.

6.
In MSE, the VAD procedure used in Step I is quite simple to implement and can be replaced with any other VAD method. In addition, the cepstral features derived from the MSE-processed spectrum can be further compensated using any cepstral-domain robustness techniques such as MVN, MVA, and HEQ to achieve further improvements in recognition performance, which will be shown in Section ‘Experimental results and discussions’.

Experimental results and discussions

We use two sets of experimental environments in this article. In the first environment, the Aurora-2 connected US-digit database [37] is the platform for evaluating the proposed MSE and other various techniques. It is used to explore the resulting spectrograms of the speech signals processed by MSE and some other spectral-domain processes, to analyze the possible improvements achievable by each approach, and to discuss the comparisons among different techniques. On the other hand, in the second environment, the NUM-100A continuous Mandarin speech database [40] is used. This database contains microphone-recoded Mandarin digit strings produced by Mandarin adults. We perform the proposed MSE on this data set to further investigate if MSE is still effective in processing the noisy speech that belongs to a different language.

Experiments for the Aurora-2 database

Here, the presented MSE scheme has been tested with the AURORA Project Database Version 2.0 (Aurora-2), the details of which are described in [37]. In short, the testing data consist of 4004 utterances from 52 female and 52 male speakers, and three different subsets are defined for the recognition experiments: Test Sets A and B are each affected by four types of noise, and Set C is affected by two types. Each noise instance is added to the clean speech signal at seven SNR levels (ranging from 20 to −5 dB). The signals in Test Sets A and B are filtered with a G.712 filter, and those in Set C are filtered with an MIRS filter. G.712 and MIRS are two standard frequency characteristics defined by the ITU [41].

The Aurora-2 task has the following two training modes [37]:

1.
In the first mode, “clean-condition training”, the training data consist of 8440 clean speech utterances from 55 female and 55 male adults.
2.
In the second mode, “multi-condition training”, the clean training data in the first mode are equally split into 20 subsets. These 20 subsets are added with four different types of noise at five different SNRs. The four noise types are suburban train, babble, car and exhibition hall, which are the same as the noise types in Test Set A. The SNRs are 20, 15, 10, and 5 dB and the clean condition.

Therefore, in the first mode, “clean-condition training”, the obtained clean acoustic models contain no information about the possible distortions. This mode can help us evaluate the degree of robust capability of the speech features (associated with the robustness algorithm) against noise. As for the second mode, “multi-condition training”, the corresponding results can reveal the impact of a different type of noise or a different SNR than seen during training [37]. In our following experiments and discussions, we will primarily focus on the first mode in order to observe the presented MSE in the reduction of noise effect. However, we will also provide the experimental results for the second mode together with relatively brief discussions.

Results for the task of clean-condition training and multi-condition testing

With the Aurora-2 database under the mode of “clean-condition training”, we perform the MSE method and a series of robustness methods to compare the recognition accuracy. As for the cepstral-domain methods, each utterance in the clean training set and three testing sets is directly converted to 13-dimensional MFCC (c 1–c 12, c 0) sequence according to the feature settings in [37]. Next, the MFCC features are processed using MVN, MVA or HEQ. The spectral-domain methods used here include our MSE, spectral subtraction (SS), Wiener filtering (WF) and MMSE-based log-spectral amplitude estimation (MMSE log-STSA). Each utterance is first processed in the linear spectral domain. The updated spectra are converted to a sequence of 13-dimensional MFCC ((c 1–c 12, c 0)). The resulting 13 new features, plus their first- and second-order derivatives, are the components of the final 39-dimensional feature vector. With the new feature vectors in the clean training set, the hidden Markov models (HMMs) for each digit and silence are trained with the demo scripts provided by the Aurora-2 CD set [42]. Each digit HMM has 16 states, with 3 Gaussian mixtures per state.

Detailed information about some of the methods used follows:

1.
We apply three versions of spectral subtraction (SS) proposed in [14–16]. For the purposes of clarity, they are denoted by SS_Boll, SS_Berouti, and SS_Kamath, respectively, in which the author names are represented by the subscripts.
2.
As with spectral subtraction, three versions of the Wiener filtering (WF) methods proposed in [18, 19] are tested here. The first method is based on a priori signal to noise ratio (PSNR) estimation, and the latter two WF methods apply a two-step noise reduction (TSNR) procedure and a harmonic regeneration noise reduction (HRNR) scheme, respectively. Thus, these methods are abbreviated as WF_PSNR, WF_TSNR, and WF_HRNR for later discussions.
3.
For the proposed MSE, the parameters δ in Equation (18) is set to 0.001, and the positive random number ε in Equation (18) is uniformly distributed within the range (0, 10⁻⁵). In order to obtain a proper selection of the filter coefficient λ in Equation (12) and the weight parameter α in Equation (18), we use the 8440 noise-corrupted training utterances for the mode of “multi-condition training” in the Aurora-2 database as the development set. The averaged recognition accuracy rates with respect to different assignments of λ and α(both from 0.1 to 0.9 with an interval of 0.2) are shown in Table 1. As a result, we set λ and α to 0.7 and 0.5, respectively, since such a setting gets the optimal accuracy rate for the development set.
4.
For MVA, the order of the ARMA filter is set to 3.
5.
For HEQ, each feature stream in the utterance is normalized to approach a Gaussian distribution with zero mean and unity variance.

Table 1 The averaged recognition accuracy rates (%) of the development set (the multi-condition training data in the AURORA-2 database) achieved by MSE with different filter coefficients λ in Equation ( 12) and exponents α in Equation (18)

Full size table

Comparison of various noise robustness approaches

Table 2 presents the individual set recognition accuracy rates averaged over five SNR conditions (0–20 dB at 5 dB intervals) for Test Sets A, B, and C, achieved using various approaches. Figure 2 shows the accuracy rates for spectral-domain methods under different SNR conditions, which are obtained by averaging over all ten noise types contained in the three Test Sets. Based on Table 2 and Figure 2, we make the following observations:

1.
Compared to baseline processing, most approaches provide significant recognition accuracy improvement in almost all cases. All three SS methods give better results than the baseline for Test Sets A and B, while the improvement for Test Set C is relatively insignificant. A possible explanation of this finding is that SS is particularly designed to alleviate additive noise and thus does not handle the channel mismatch in the utterances of Test Set C very well. On the other hand, WF_PSNR performs the best among the three Wiener filtering approaches, while WF_TSNR and WF_HRNR result in poorer accuracy rates relative to the MFCC baseline. Furthermore, WF_PSNR behaves better than SS and is also very helpful with Test Set C. Finally, the method “MMSE log-STSA” performs quite well, and its corresponding averaged recognition accuracy is slightly better than that of WF_PSNR.
2.
Among the spectral-domain methods studied, the proposed MSE method outperforms MMSE log-STSA and various versions of SS and WF in almost all cases. Furthermore, MSE leads to a relative error reduction rate of 49.82% for additive-noise conditions (Test Sets A and B) and 42.72% for all conditions (Test Sets A, B and C) compared with baseline results. The results show that MSE effectively enhances the robustness of MFCC in various noise-corrupted environments.
3.
The proposed MSE method provides very promising recognition accuracy rates for all SNR conditions. In particular, MSE outperforms WF_PSNR and MMSE log-STSA for higher SNR cases (20 and 15 dB), and the three methods deliver very similar accuracy rates for lower SNR cases.
4.
Among the three cepstral-domain methods, HEQ behaves the best, followed by MVA and then MVN. In addition, the three cepstral-domain methods perform better than most spectral-domain methods, with the exception that MVN performs worse than MSE for Test Sets A and B. This finding leads to the concept of integrating these cepstral-domain methods with the proposed MSE as discussed below. It will be shown that such integration can offer further improvements in performance.
5.
In order to examine if the presented MSE gives rise to a statistically significant improvement in recognition accuracy relative to the other methods, the one-proportion z-test [43] is performed as follows: Let p and p ₀ denote the accuracy rates provided by MSE and the method for comparison, respectively. We set the null hypothesis as H ₀:p=p ₀ and the alternative hypothesis H ₁:p>p ₀, and the test statistic for the hypothesis is:
$z = \frac{p - p_{0}}{\sqrt{p_{0} (1 - p_{0}) / N}},$
(25)

where N is the number of words in the test and here N=214465 for the Aurora-2 evaluation task [37]. If the test statistic z in Equation (25) is larger than about 2.326, then the null hypothesis H₀ is rejected and the improvement is statistically significant with a confidence level of 99% (since $\int_{2.326}^{\infty} \frac{1}{\sqrt{2 Π}} e^{- \frac{u^{2}}{2}} du \approx 1 % = 1 - 99 %$ ). According to the obtained test statistic z in Equation (25), we find that the improvement brought by MSE relative to the other spectral-domain methods is statistically significant. For example, when the method for comparison is MMSE log-STSA, the corresponding test statistic z in Equation (25) is 41.99, far larger than the threshold 2.326.

Table 2 Recognition accuracy (%) achieved by various approaches for Aurora-2 clean-condition training task averaged across the SNRs between 0 and 20 dB, where AVG (%) and RR (%) are the averaged accuracy rate and the relative error rate reduction over the baseline

Full size table

In addition to the recognition accuracy, we also examine the various spectral-domain methods’ capabilities of reducing the spectrogram mismatch caused by additive noise. Figures 3, 4, 5, 6, 7, 8, 9, and 10 show the spectrograms of a digit utterance (“FLJ_97159A.08” in the Aurora-2 database) for two SNR levels, clean and 5 dB (with babble noise), obtained by SS_Boll, SS_Berouti, SS_Kamath, WF_PSNR, WF_TSNR, WF_HRNR, MMSE log-STSA and the proposed MSE, respectively. First, the figures show that for the clean case, the voiced portions and the short pauses between any two consecutive digits or syllables are clearly revealed using almost all approaches. Second, for the noise-corrupted case, WF_PSNR, MMSE log-STSA, and MSE highlight the short pauses more than the other approaches, and they preserve the voiced segments better with less distortion (especially in the region [0.7 s, 1.3 s]). Thus, the similar treatment of these short pauses under clean and noise-corrupted conditions using the three methods may result in a relatively insignificant mismatch between the two SNR conditions, causing the higher recognition accuracy shown previously. Finally, the detected speech segments are quite obviously separated in the MSE-processed spectrogram, and this fact may be one reason why MSE performs very well.

Integration of MSE with cepstral feature processing techniques

MSE, which is performed on the spectral domain of features, can be easily integrated with cepstral-domain processing techniques. Here, we test whether such integration brings about further recognition performance. MFCC features are first derived from the MSE-processed spectra and then processed using MVN, HEQ, or MVA. For a more complete comparison, we also integrate any of the spectral-domain methods, SS_Berouti, WF_PSNR, and MMSE log-STSA, with the cepstral-domain method. The corresponding recognition results are shown in Table 3. For the comparison purposes, the accuracy rates for MSE, SS_Berouti, WF_PSNR, MMSE log-STSA, MVN, HEQ, and MVA are relisted from Table 2. Several findings are reported in Table 3:

1.
The combination of MSE and the cepstral-domain method produces better results than the individual component methods in most cases. For example, MSE plus MVA (82.37%) is better than MSE (76.94%) and MVA (78.75%) in recognition accuracy averaged over ten noise types among the three Test Sets and results in a relative error reduction rate of 56.20%. Similar results are achieved with MSE plus MVN and MSE plus HEQ. These results clearly indicate that MSE can be successfully added to cepstral-domain approaches to further improve noise robustness.
2.
For the channel-distorted signals in Test Set C, MSE performs worse than the cepstral-domain methods alone. However, combining MSE with either of MVN and MVA can yield better recognition rates with regard to Test Set C. For example, MSE plus MVA (80.58%) is better than MVA alone (79.12%) in averaged recognition accuracy. Therefore, MSE enhances MVN and MVA in processing channel-distorted signals even though it is primarily designed for additive-noise conditions.
3.
Different from MSE, combining any of the three spectral-domain methods, SS_Berouti, WF_PSNR, MMSE log-STSA, with any cepstral-domain method performs worse than the component cepstral-domain method alone. For example, MMSE log-STSA plus HEQ achieves an averaged accuracy of 79.58%, less than 82.21% obtained by the single HEQ. These results again imply that the presented MSE outperforms the other three spectral-domain methods used here.

Table 3 Recognition accuracy (%) achieved by various approaches for Aurora-2 clean-condition training task averaged across the SNRs between 0 and 20 dB, where AVG (%) and RR (%) are the averaged accuracy rate and the relative error rate reduction over the baseline

Full size table

The influence of the VAD error for MSE in speech recognition

In this section, we first investigate the effect of the VAD error on recognition performance in MSE. For this purpose, we perform MSE under the “oracle condition”. That is, the VAD results for each clean utterance are directly applied to its various noise-corrupted counterparts to implement the magnitude spectrum enhancement. This process is referred to as “MSE^(o)” here. Assuming that the VAD error of MSE for a clean utterance is small and negligible, the recognition accuracy difference between MSE^(o)and MSE for noise-corrupted utterances can be viewed as a consequence of the VAD error due to noise.

The recognition accuracy rates for MSE^(o) and MSE are listed in Table 4. As expected, MSE^(o)always performs better than MSE because it contains no VAD errors. However, the difference in accuracy is not very significant. In the worst case (SNR = 0 dB), the performance degradation is 4.96% (1.64% for Set A, 8.77% for Set B, and 4.00% for Set C), and on average, it is 2.90% (1.57% for Set A, 3.92% for Set B, and 3.49% for Set C). These results indicate that the performance of MSE is somewhat influenced by the the error of the embedded VAD process.

Table 4 Recognition accuracy (%) achieved by MSE ^(o) and MSE for Aurora-2 clean-condition training task, where MSE ^(o) is MSE employing nearly error-free VAD results (MSE in the oracle condition)

Full size table

Next, we select different VAD indicators for MSE to see the corresponding effect. According to the analysis in Section ‘Effect of additive noise on the logarithmic magnitude spectrum in the frame sequences’, the high-pass filtered logarithmic magnitude spectrum (log MS) and the logarithmic energy (log E) can emphasize the difference of the speech and non-speech frames, and thus they are chosen to be the VAD indicators of MSE. Here, we adopt the following two alternatives as the VAD indicators:

1.
the original linear magnitude spectrum (_{X
m}[k] in Equation (10)) and the energy (the exponent of _{e
m}Equation (11)),
2.
the high-pass filtered linear magnitude spectrum and the high-pass filtered energy,

and the corresponding two MSE processes are denoted by ${MSE}^{(L_{1})}$ and ${MSE}^{(L_{2})}$ , respectively, for simplicity. Figure 11 shows the recognition accuracy rates for ${MSE}^{(L_{1})}$ and ${MSE}^{(L_{2})}$ under different SNR conditions for the three Test Sets, and we add the results of the original MSE in this figure for comparison. From this figure, we find that when the SNR is high (clean and 20 dB), there is no substantial performance difference among the three MSE methods. However, when the noise level becomes larger, the original MSE significantly outperforms the other two versions of MSE, ${MSE}^{(L_{1})}$ , and ${MSE}^{(L_{2})}$ . As a result, compared with the linear magnitude spectrum and energy, the high-pass filtered logarithmic magnitude spectrum (as well as the logarithmic energy) can provide more accurate VAD under noisy conditions and achieve better recognition results for the subsequent MSE processing.

Further issues regarding MSE processing

Several issues relating to the new proposed MSE scheme are further investigated in this Section.

The effect of the exponent α in MSE

One of the central ideas of MSE is to amplify the spectral magnitude for speech frames, and from Equations (18) and (24), the amplification factor (for speech frame) is

w_{m} [k] = {(\frac{| X_{m} [k] |}{N [k] + δ})}^{α} \approx {(\sqrt{SN R_{m} [k]} + 1)}^{α},

(26)

Examining Equation (26), the exponent value α controls the degree of amplification. Increasing the value of α enlarges the difference between the speech and non-speech frames in magnitude spectrum and may also lead to a greater mismatch among the speech frames for the same syllable or phoneme under different SNR conditions. As a result, a larger α in MSE does not always bring about improved recognition accuracy, even if the VAD contains no errors. Here, we assign the exponent α to different values within the range [0, 1] and then proceed with MSE to investigate the corresponding recognition accuracy.

Figure 12 shows the recognition results averaged over five SNR conditions (0 ∼ 20 dB) and all ten noise types in the three Test Sets for different values of α for MSE (the filter coefficient λ in Equation (12) is fixed as 0.7). As shown in Figure 12, we find that

1.
The case α=0, where the magnitude spectrum is kept unchanged in MSE, yields an averaged recognition accuracy of 72.26%, significantly better than the MFCC baseline result (59.75%). This result shows that simply setting the magnitude spectrum of the detected non-speech frames to be nearly zero is beneficial to the recognition performance.
2.
The recognition accuracy improves as the value α is increased from 0 to 0.6, and the additional improvement in accuracy is 4.80% (from 72.26% to 77.06%). Therefore, amplifying the magnitude spectrum of the speech frames correctly is helpful.
3.
When the exponent α is further increased from 0.6 to 1, the recognition rates worsen, possibly due to the enlarged mismatch among the speech frames mentioned previously. However, the decrease in maximum accuracy is just 0.62% (from 77.06% at α=0.6 to 76.44% at α=0.8), implying that the recognition accuracy is relatively insensitive to α(provided that α is within the range [0.6, 1]).

The effect of the filter coefficient λ in MSE

As stated in Section ‘The magnitude spectrum enhancement (MSE) approach’, the filter coefficient λ in Equation (12) determines the frequency response of the high-pass filter for the VAD process of MSE. The case λ=0 corresponds to using the logarithmic magnitude spectrum (logMS) and the log-energy (log E) directly as the VAD features. On the other hand, increasing values of λ indicate that the lower/higher modulation frequency components are further reduced/emphasized in the logMS and log E streams, as shown in Figure 1. This parameter was preliminarily set to 0.7 in the previous experiments. Now, we vary its value from 0 to 0.9, spaced in 0.1 intervals, to perform the corresponding MSE. (Note that setting λ=1 will result in an unstable filter.)

Figure 13 shows the recognition results averaged over five SNR conditions (0 ∼ 20 dB) and all ten noise types in the three Test Sets using different values of λ for MSE (the exponent α in Equation (26) is fixed as 0.5). We first find that applying MSE with a positive λ achieves better results than applying MSE with λ=0 in most cases, indicating that emphasizing the higher modulation frequency components enhances the VAD of MSE. Next, setting λ to 0.8 yields the optimal accuracy rate (77.04%), 0.10% better than the accuracy obtained by setting λ=0.7 (76.94%). Finally, when the value of λ is within the range [0.1, 0.9], the differences among the accuracy rates obtained with different values of λ are relatively small, and the decrease in maximum accuracy is just 1.82%. This result implies that nearly optimal performance can be obtained without meticulous adjustment of the parameter λ.

The effect of processing the short pauses within the utterance in MSE

In the VAD procedure of MSE, each frame in an utterance is always classified as either speech or non-speech. Therefore, no frame will be classified as a “transient frame”, as is the case for some more delicate VAD processes. In fact, the transient frames that exist in the short region between two connected acoustic units (which are often called “short pauses”) are quite often classified as non-speech in MSE, and thus their magnitude spectrums are assigned as very small. For this reason, the VAD in MSE is unlike some conventional end-point detectors, in which only the onset and offset frames of an utterance are decided, while the inter-word or inter-syllable frames that often possess lower energy are not processed. However, we find that further processing of these detected short pauses between the onset and offset times for utterances is quite helpful in speech recognition, especially when the SNR is low. To demonstrate this phenomenon, a simpler form of MSE is designed, in which we only process the first and the last detected non-speech segments (the corresponding frames are assigned to have very small magnitude spectra) and treat the remaining non-speech segments as speech (the magnitude spectra of the corresponding frames are weighted as in Equation (24)). This method is called “MSE^(s)” here for simplicity, and we compare it with the original MSE with respect to speech recognition performance.

Figure 14 shows the recognition accuracy rates for MSE^(s) under different SNR conditions for the three Test Sets. In this figure, we see that almost no performance difference exists between MSE^(s) and MSE for the clean condition. However, when noise is present, MSE^(s) always performs worse than MSE, and the performance difference becomes more significant as the SNR decreases. On average, MSE^(s)is around 4% less effective in recognition accuracy than MSE. In general, in acoustic model training, a short pause model is trained to aid in word or syllable boundary determination and thus to improve the recognition accuracy. However, under noise-corrupted conditions, the short pause model becomes less helpful, as shown in the MSE^(s) results. Furthermore, in MSE, we see that to further classify the transient frames as non-speech (the corresponding magnitude spectrum then becomes very small) within the voice-activated region of an utterance significantly improves the recognition accuracy for noise-corrupted environments.

Results for the task of multi-condition training and multi-condition testing

We perform the MSE method, SS_Berouti (which performs the best among the three SS methods in Table 2), WF_PSNR (which performs the best among the three WF methods in Table 2) and three cepstral-domain methods aforementioned with the Aurora-2 database under the mode of “multi-condition training”. As stated earlier, here the training data have five SNR conditions (clean, 20, 15, 10, and 5 dB) and four types of noise the same as those in Test Set A. In addition to the individual method, here we also investigate the effect of the pairing of the spectral-domain method and the cepstral-domain method to see if further accuracy improvement can be achieved. Table 5 presents the individual set recognition accuracy rates averaged over five SNR conditions for Test Sets A, B, and C, achieved by the various methods. We have the following findings from Table 5:

1.
For the spectral-domain methods, SS_Berouti and WF_PSNR degrade the accuracy of the MFCC. The proposed MSE provides the MFCC with around 1% accuracy improvement for Test Sets A and B (additive-noise environments), but it still worsens the recognition accuracy for Test Set C (with both additive noise and channel distortion). A possible explanation is that, these spectral-domain methods introduce distortions and mitigate the discriminative components in the speech features when they alleviate the noise effect in the multi-condition training data.
2.
In contrast with the spectral-domain methods, the three cepstral-domain methods can give significant performance improvement over the MFCC baseline. MVA behaves the best, followed by MVN and then HEQ. We find that MVN outperforms HEQ slightly, which is not the case for the mode of clean-condition training as shown in Table 2. This phenomenon is probably because the mismatch between the training data and the testing data is relatively small in the mode of multi-condition training, and the over-normalization problem may occur in HEQ, which results in worse accuracy relative to MVN.
3.
None of the three spectral-domain methods, SS_Berouti, WF_PSNR and MSE, can help the subsequent cepstral-domain method to provide better recognition accuracy rates in comparison with the single cepstral-domain method. These results again imply these spectral-domain methods very probably diminish the helpful speech components in the noisy training data and are inappropriate for the task of multi-condition training.

Table 5 Recognition accuracy (%) achieved by various approaches for Aurora-2 multi-condition training task averaged across the SNRs between 0 and 20 dB, where AVG (%) and RR (%) are the averaged accuracy rate and the relative error rate reduction over the baseline

Full size table

Experiments for the Num-100A database

Besides the Aurora-2 database, here we adopt another database, called NUM-100A [40], to test the performance of the presented MSE. The NUM-100A database consists of 8,000 Mandarin digit strings produced by 50 male and 50 female speakers, recorded in a normal laboratory environment at an 8 kHz sampling rate. These 8000 digit strings include 1000 each of two-, three-, four-, five-, six-, and seven-digit strings, respectively, plus 2000 single digit utterances. Among the 8000 Mandarin digital strings, 7520 with different lengths are selected for training, while the other 480 are for testing. In particular, the 480 clean testing strings are added with four types of noise (white, babble, pink and f16) taken from the NOISEX-92 database [44] at four different SNRs (20, 15, 10, and 5 dB) to produce the noise-corrupted testing data. The speech features used here are the same as those in the Aurora-2 task, which contain 13 MFCCs (c 1–c 12, c 0) and their delta and delta-delta. With the feature vectors in the training set, the HMMs for each of the 10 digits and silence were trained with the HTK toolkit [45]. Each digit HMM contains five states and eight mixtures per state, and the silence HMM has three states and eight mixtures per state.

For simplicity, we use the MSE with the same parameter settings in Aurora-2 task to process the training and testing signals and to create the corresponding MFCC features. In addition, since we just intend to investigate if MSE is also helpful to improve the noisy speech recognition for another database besides Aurora-2, we do not perform the other spectral-domain methods like SS and WF, and simply choose one cepstral-domain method, MVN, for processing the MFCC features.

Figures 15 and 16a–d show the recognition accuracy rates for the four methods, MFCC baseline, MSE, MVN and the pairing of MSE and MVN, under the clean and four noise-corrupted situations with different SNRs. From these figures, we have the following findings:

1.
Under the clean and matched condition, both MSE and MVN degrades the recognition rate of the MFCC slightly, and the combination of MSE and MVN gets the worst results. These results imply that the robustness methods can probably reduce the discriminability of the original features when the environment is noise-free.
2.
The recognition accuracy of the original MFCC gets apparently worse at mismatched noisy situations. However, the presented MSE can enhance the MFCC and bring about significant accuracy improvement irrespective of the type of noise. For example, at the SNR of 10 dB, MSE provides the MFCC with the accuracy rate improvements of 20.09%, 53.42%, 26.77%, and 41.74% for the noise being white, babble, pink and f16, respectively. Therefore, we show that MSE works well as a noise-robustness approach for this Mandarin digit database in addition to Aurora-2.
3.
MVN promotes the recognition accuracy very well relative to the MFCC baseline when the environment is noisy, and it outperforms MSE in most cases. However, the cascade of MSE and MVN performs better than MVN alone (except for the babble and f16 noises at the SNR of 20 dB), showing again that MSE is well additive to the cepstral-domain method, MVN.

Conclusions

In this article, we investigate the effect of additive noise on the linear and logarithmic spectra of noise-corrupted utterances and provide a compensation scheme, called magnitude spectrum enhancement (MSE), to enhance the noise robustness of speech features. MSE aims to shrink the magnitude spectra in the silence portion of an utterance and to strengthen them in the speech portion. Experimental results show that MSE is very effective in promoting recognition performance under various noise conditions for the Aurora-2 clean-condition training task, and its performance is greater than that of spectral subtraction and Wiener filtering. Furthermore, MSE can successfully be implemented additively to cepstral-domain methods to deliver even better recognition rates.

Appendix 1

Given that A=|A|e^jϕ is a complex-valued constant, and N=_{N
R} + j_{N
I} is a complex-valued random variable which real and imaginary parts, _{N
R}and _{N
I}, are independent Gaussian distributed with zero mean and a common variance σ², then it can be shown that [38]:

1.
The random variable |A + N| is Rician distributed, and its probability density function (pdf) is
$f_{| A + N |} (x) = \frac{x}{σ^{2}} exp (\frac{- (x^{2} + | A |^{2})}{2 σ^{2}}) I_{0} (\frac{| A |^{2}}{σ^{2}} x) u (x),$
(27)

where _{I 0}(.) is the modified Bessel function of the first kind with order zero, and u(.) is the unit-step function.
2.
The random variable |N| is Rayleigh distributed, and its probability density function (pdf) is
$f_{| N |} (x) = \frac{x}{σ^{2}} exp (\frac{- x^{2}}{2 σ^{2}}) u (x) .$
(28)

Therefore, the items |_{S
p}k + _{D
p}k| and |_{D
q}k| in Equation (3) are Rician and Rayleigh distributed, respectively. Furthermore, assuming _{D
p}k and _{D
q}k are statistically independent (since they correspond to different frames) and identically distributed, we have Equation (3) as

\begin{align} γ [k] & = E (\frac{| S_{p} [k] + D_{p} [k] |}{D_{q} [k]}) \\ = E (\frac{1}{| D_{q} [k] |}) E (| S_{p} [k] + D_{p} [k] |) \\ = (\int_{0}^{\infty} \frac{1}{x} (\frac{x}{σ^{2}} exp (\frac{- x^{2}}{2 σ^{2}})) dx) \\ \times (σ {\sqrt{\frac{Π}{2}}}_{1} F_{1} (- \frac{1}{2}; 1; - \frac{| S_{p} [k] |^{2}}{2 σ^{2}})) \\ = (\sqrt{\frac{Π}{2}} \frac{1}{σ}) (σ {\sqrt{\frac{Π}{2}}}_{1} F_{1} (- \frac{1}{2}; 1; - \frac{| S_{p} [k] |^{2}}{2 σ^{2}})) \\ = {\frac{Π}{2}}_{1} F_{1} (- \frac{1}{2}; 1; - \frac{| S_{p} [k] |^{2}}{2 σ^{2}}), \end{align}

(29)

where ₁_{F 1}(. , . , .) is the confluent hypergeometric function [38]:

\begin{matrix} _{1} F_{1} (- \frac{1}{2}; 1; - x) = exp (- \frac{x}{2}) ((1 + x) I_{0} (\frac{x}{2}) + x I_{1} (\frac{x}{2})), \end{matrix}

(30)

in which _{I 1}(.) is the modified Bessel function of the first kind with order one.

It can be shown that

\begin{align} _{1} F_{1} (- \frac{1}{2}; 1; - x) > 0, for x > 0 \end{align}

(31)

and since $\frac{d}{dx} I_{0} (x) = I_{1} (x)$ and $\frac{d}{dx} (x I_{1} (x)) = x I_{0} (x)$ , we have

\begin{align} \frac{d}{dx} (_{1} F_{1} (- \frac{1}{2}; 1; - x)) \\ = exp (- \frac{x}{2}) (\frac{1}{2} I_{0} (\frac{x}{2}) + \frac{3}{2} I_{1} (\frac{x}{2})) > 0, for x > 0 . \end{align}

(32)

Therefore, $_{1} F_{1} (- \frac{1}{2}; 1; - x)$ is a positive and monotonically increasing function for x>0, and we conclude that the parameter $γ [k] = {\frac{Π}{2}}_{1} F_{1} (- \frac{1}{2}; 1; - \frac{| S_{p} [k] |^{2}}{2 σ^{2}})$ in Equation (29) decreases as the noise variance σ² increases (with decreasing $\frac{1}{σ^{2}}$ ) , with two limiting cases $\lim_{σ^{2} \to 0} γ [k] = \infty$ and $\lim_{σ^{2} \to \infty} γ [k] = \frac{Π}{2}$ .

References

Holmes JN, Sedgwick NC: Noise compensation for speech recognition using probabilistic models. 1986 International Conference on Acoustics, Speech and Signal Processing (ICASSP’86), vol. 11 (Tokyo, Japan, 1986), pp. 741–744
Google Scholar
Klatt DH: A digital filter bank for spectral matching. 1979 International Conference on Acoustics, Speech and Signal Processing (ICASSP’76), vol. 1 (Philadelphia, USA, 1976), pp. 573–576
Google Scholar
Nadas A, Nahamoo D, Picheny M: Speech recognition using noise-adaptive prototypes. 1988 International Conference on Acoustics, Speech and Signal Processing (ICASSP’88), vol. 1 (New York, USA, 1988), pp. 517–520
Google Scholar
Varga AP, Moore RK: Hidden Markov model decomposition of speech and noise. 1990 International Conference on Acoustics, Speech and Signal Processing (ICASSP’90), vol. 2 (Albuquerque, USA, 1990), pp. 845–848
Chapter Google Scholar
Acero A, Deng L, Kristjansson T, Zhang J: HMM adaptation using vector Taylor series for noisy speech recognition. 2000 International Conference on Spoken Language Processing (ICSLP’00), vol. 3 (Beijing, China, 2000), pp. 869–872
Google Scholar
Leggester CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density HMMs. Comput. Speech Lang 1995, 9: 171-186. 10.1006/csla.1995.0010
Article Google Scholar
Sankar A, Lee C-H: A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE Trans. Speech Audio Process 1996, 4: 190-202. 10.1109/89.496215
Article Google Scholar
Lee C-H: On stochastic feature and model compensation approaches to robust speech recognition. Speech Commun 1998, 25: 29-47. 10.1016/S0167-6393(98)00028-4
Article Google Scholar
Ning G-X, Wei G, Chu K-K: Model compensation approach based on nonuniform spectral compression features for noisy speech recognition. EURASIP J. Adv. Signal Process 2007., 2007:
Google Scholar
Moreno PJ, Raj B, Stern RM: Data-driven environmental compensation for speech recognition: a unified approach. Speech Commun 1998, 24: 267-285. 10.1016/S0167-6393(98)00025-9
Article Google Scholar
Gales MJF, Young SJ: Cepstral parameter compensation for HMM recognition in noise. Speech Commun 1993, 12: 231-239. 10.1016/0167-6393(93)90093-Z
Article Google Scholar
Gales MJF, Young SJ: Robust speech recognition in additive and convolutional noise using parallel model combination. Comput. Speech Lang 1995, 9: 289-307. 10.1006/csla.1995.0014
Article Google Scholar
Gales MJF, Young SJ: A fast and flexible implementation of parallel model combination. 1995 International Conference on Acoustics, Speech and Signal Processing (ICASSP’95), vol. 1 (Detroit, USA, 1995), pp. 133–136
Chapter Google Scholar
Boll SF: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process 1979, 27: 113-120. 10.1109/TASSP.1979.1163209
Article Google Scholar
Berouti M, Schwartz R, Makhoul J: Enhancement of speech corrupted by acoustic noise. 1979 International Conference on Acoustics, Speech and Signal Processing (ICASSP’79), vol. 4 (Washington, USA, 1979), pp. 208–211
Google Scholar
Kamath S, Loizou P: A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. 2002 International Conference on Acoustics, Speech and Signal Processing (ICASSP’02), vol. 4 (Orlando, USA, 2002), pp. IV–4164
Chapter Google Scholar
BabaAli B, Sameti H, Safayani M: Likelihood maximizing based multi-band spectral subtraction for robust speech recognition. EURASIP J. Adv. Signal Process 2009., 2009:
Google Scholar
Scalart P, Filho JV: Speech enhancement based on a priori signal to noise estimation. 1996 International Conference on Acoustics, Speech and Signal Processing (ICASSP’96), vol. 2 (Atlanta, USA, 1996), pp. 629–632
Chapter Google Scholar
Plapous C, Marro C, Scalart P: Improved signal-to-noise ratio estimation for speech enhancement. IEEE Trans. Audio Speech Lang. Process 2006, 14: 2098-2108.
Article Google Scholar
Ephraim Y, Malah D: Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process 1984, 32: 1109-1121. 10.1109/TASSP.1984.1164453
Article Google Scholar
Ephraim Y, Malah D: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process 1985, 33: 443-445. 10.1109/TASSP.1985.1164550
Article Google Scholar
Acero A: Acoustical and environmental robustness in automatic speech recognition,. Ph.D. dissertation, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburg, PA (1990)
Google Scholar
Chu KK, Leung SH: SNR-dependent non-uniform spectral compression for noisy speech recognition. 2004 International Conference on Acoustics, Speech and Signal Processing (ICASSP’04), vol. 1 (Montreal, Canada, 2004), pp. 973–976
Google Scholar
Deng L, Acero A, Jiang L, Droppo J, Huang X: High-performance robust speech recognition using stereo training data. 2001 International Conference on Acoustics, Speech and Signal Processing (ICASSP’01), vol. 1 (Salt Lake City, USA, 2001), pp. 301–304
Chapter Google Scholar
Droppo J, Deng L, Acero A: Evaluation of the SPLICE algorithm on the Aurora2 database. 2001 Eurospeech Conference on Speech Communications and Technology (Eurospeech’01), vol. 1 (Aalborg, Denmark, 2001), pp. 185–188
Google Scholar
Atal BS: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J. Acoust. Soc. Am 1974, 55: 1304-1312. 10.1121/1.1914702
Article Google Scholar
Tibrewala S, Hermansky H: Multiband and adaptation approaches to robust speech recognition. 1997 Eurospeech Conference on Speech Communications and Technology (Eurospeech’97), vol. 1 (Rhodes, Greece, 1997), pp. 2619–2622
Google Scholar
Chen C-P, Bilmes JA: MVA processing of speech features. IEEE Trans. Audio Speech Lang. Process 2007, 15: 257-270.
Article Google Scholar
Yoshizawa S, Hayasaka N, Wada N, Miyanaga Y: Cepstral gain normalization for noise robust speech recognition. 2004 International Conference on Acoustics, Speech and Signal Processing (ICASSP’04), vol. 1 (Montreal, Canada, 2004), pp. 209–212
Google Scholar
Hilger F, Ney H: Quantile based histogram equalization for noise robust large vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process 2006, 14: 845-854.
Article Google Scholar
Suh Y, Kim H: Histogram equalization to model adaptation for robust speech recognition. EURASIP J. Adv. Signal Process 2010., 2010:
Google Scholar
Du J, Wang R-H: Cepstral shape normalization (CSN) for robust speech recognition. 2008 International Conference on Acoustics, Speech and Signal Processing (ICASSP’08), vol. 1 (Las Vegas, USA, 2008), pp. 4389–4392
Google Scholar
Zhu W, O’Shaughnessy D: Log-energy dynamic range normalization for robust speech recognition. 2005 International Conference on Acoustics, Speech and Signal Processing (ICASSP’05), vol. 1 (Philadelphia, USA , 2005), pp. 245–248
Chapter Google Scholar
Hwang T-H, Chang S-C: Energy contour enhancement for noisy speech recognition. 2004 International Symposium on Chinese Spoken Language Processing (ISCSLP’04), vol. 1 (Hong Kong, China, 2004), pp. 249–252
Chapter Google Scholar
Wang C-C, Pan C-A, Hung J-W: Silence feature normalization for robust speech recognition in additive noise environments. 2008 International Conference on Spoken Language Processing (Interspeech 2008-ICSLP), vol. 1 (Brisbane, Australia, 2008), pp. 1028–1031
Tu W-H, Hung J-W: Magnitude spectrum enhancement for robust speech recognition. 2010 International Conference on Acoustics, Speech and Signal Processing (ICASSP’10), vol. 1 (Dallas, USA, 2010), pp. 4586–4589
Hirsch HG, Pearce D: The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions. ISCA ITRW ASR2000 “Automatic Speech Recognition: Challenges for the Next Millennium”, vol. 1 (Paris, France, 2000), pp. 181–188
Haykin S: Communication Systems,. (John Wiley & Sons, Inc., New York, 2000)
Google Scholar
Turin GL: An introduction to matched filters. IRE Trans. Inf. Theory 1960, 6: 311-329. 10.1109/TIT.1960.1057571
Article MathSciNet Google Scholar
Available from:: , the Association for Computational Linguistics and Chinese Language Processing (ACLCLP). http://www.aclclp.org.tw
ITU recommendation G.712: Transmission performance characteristics of pulse code modulation channels. 1996.
Google Scholar
Available from:: , Evaluations and Language resources Distribution Agency (ELDA). http://www.elda.org/article52.html
Hogg RV, Tanis EA: Probability and Statistical Inference,. (Prentice Hall, Upper Saddle River, NJ, 2006)
MATH Google Scholar
Varga AP, Steeneken HJM, Tomlinson M, Jones D: The NOISEX-92 study on the effect of additive noise on automatic speech recognition,. Tech. Rep. DRA Speech Research Unit, 1992
Available from:: , the Hidden Markov Model Toolkit, Cambridge University Engineering Dept. (CUED). http://htk.eng.cam.ac.uk/

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, National Chi Nan University, Taiwan, Republic of China
Jeih-weih Hung, Hao-teng Fan & Wen-hsiang Tu

Authors

Jeih-weih Hung
View author publications
You can also search for this author in PubMed Google Scholar
Hao-teng Fan
View author publications
You can also search for this author in PubMed Google Scholar
Wen-hsiang Tu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeih-weih Hung.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Authors’ original file for figure 18

Authors’ original file for figure 19

Authors’ original file for figure 20

Authors’ original file for figure 21

Authors’ original file for figure 22

Authors’ original file for figure 23

Authors’ original file for figure 24

Authors’ original file for figure 25

Authors’ original file for figure 26

Authors’ original file for figure 27

Authors’ original file for figure 28

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Hung, Jw., Fan, Ht. & Tu, Wh. Enhancing the magnitude spectrum of speech features for robust speech recognition. EURASIP J. Adv. Signal Process. 2012, 189 (2012). https://doi.org/10.1186/1687-6180-2012-189

Download citation

Received: 15 February 2012
Accepted: 13 August 2012
Published: 30 August 2012
DOI: https://doi.org/10.1186/1687-6180-2012-189

Enhancing the magnitude spectrum of speech features for robust speech recognition

Abstract

Abstract

Introduction

Effect of additive noise to the linear and logarithmic magnitude spectrum of a speech signal

Effect of additive noise on the magnitude spectra of speech/non-speech frames

Effect of additive noise on the logarithmic magnitude spectrum in the frame sequences

The magnitude spectrum enhancement (MSE) approach

Experimental results and discussions

Experiments for the Aurora-2 database

Results for the task of clean-condition training and multi-condition testing

Comparison of various noise robustness approaches

Integration of MSE with cepstral feature processing techniques

The influence of the VAD error for MSE in speech recognition

Further issues regarding MSE processing

The effect of the exponent α in MSE

The effect of the filter coefficient λ in MSE

The effect of processing the short pauses within the utterance in MSE

Results for the task of multi-condition training and multi-condition testing

Experiments for the Num-100A database

Conclusions

Appendix 1

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords