Dereverberation and denoising based on generalized spectral subtraction by multi-channel LMS algorithm using a small-scale microphone array
© Wang et al; licensee Springer. 2012
Received: 15 June 2011
Accepted: 17 January 2012
Published: 17 January 2012
A blind dereverberation method based on power spectral subtraction (SS) using a multi-channel least mean squares algorithm was previously proposed to suppress the reverberant speech without additive noise. The results of isolated word speech recognition experiments showed that this method achieved significant improvements over conventional cepstral mean normalization (CMN) in a reverberant environment. In this paper, we propose a blind dereverberation method based on generalized spectral subtraction (GSS), which has been shown to be effective for noise reduction, instead of power SS. Furthermore, we extend the missing feature theory (MFT), which was initially proposed to enhance the robustness of additive noise, to dereverberation. A one-stage dereverberation and denoising method based on GSS is presented to simultaneously suppress both the additive noise and nonstationary multiplicative noise (reverberation). The proposed dereverberation method based on GSS with MFT is evaluated on a large vocabulary continuous speech recognition task. When the additive noise was absent, the dereverberation method based on GSS with MFT using only 2 microphones achieves a relative word error reduction rate of 11.4 and 32.6% compared to the dereverberation method based on power SS and the conventional CMN, respectively. For the reverberant and noisy speech, the dereverberation and denoising method based on GSS achieves a relative word error reduction rate of 12.8% compared to the conventional CMN with GSS-based additive noise reduction method. We also analyze the effective factors of the compensation parameter estimation for the dereverberation method based on SS, such as the number of channels (the number of microphones), the length of reverberation to be suppressed, and the length of the utterance used for parameter estimation. The experimental results showed that the SS-based method is robust in a variety of reverberant environments for both isolated and continuous speech recognition and under various parameter estimation conditions.
In a distant-talking environment, channel distortion drastically degrades speech recognition performance because of a mismatch between the training and testing environments. The current approach focusing on automatic speech recognition (ASR) robustness to reverberation and noise can be classified as speech signal processing, robust feature extraction, and model adaptation [1–3].
In this paper, we focus on speech signal processing in the distant-talking environment. Because both the speech signal and the reverberation are nonstationary signals, dereverberation to obtain clean speech from the convolution of nonstationary speech signals and impulse responses is very hard work. Several studies have focused on mitigating the above problem. A blind deconvolution-based approach for the restoration of speech degraded by the acoustic environment was proposed in . The proposed scheme processed the outputs of two microphones using cepstra operations and the theory of signal reconstruction from the phase only. Avendano et al. [5, 6] explored a speech dereverberation technique for which the principle was the recovery of the envelope modulations of the original (anechoic) speech. They applied a technique that they originally developed to treat background noise  to the dereverberation problem. A novel approach for multimicrophone speech dereverberation was proposed in . The method was based on the construction of the null subspace of the data matrix in the presence of colored noise, employing generalized singular-value decomposition or generalized eigenvalue decomposition of the respective correlation matrices. A reverberation compensation method for speaker recognition using SS, in which late reverberation is treated as additive noise, was proposed in [9, 10]. However, the drawback of this approach is that the optimum parameters for SS are empirically estimated from a development dataset and the late reverberation cannot be subtracted correctly as it is not modeled precisely.
Power SS is the most commonly used SS method. A previous study has shown that GSS with a lower exponent parameter is more effective than power SS for noise reduction . In this paper, instead of using power SS, GSS is employed to suppress late reverberation. We also investigate the use of missing feature theory (MFT)  to enhance the robustness to noise, in combination with GSS, since the reverberation cannot be suppressed completely owing to the estimation error of the impulse response. Soft-mask estimation-based MFT calculates the reliability of each spectral component from the signal-to-noise ratio (SNR). This idea is applied to reverberant speech. However, the reliability estimation is complicated in a distant-talking environment. In , reliability is estimated from the time lag between the power spectrum of the clean speech and that of the distorted speech. In this paper, reliability is estimated by the signal-to-reverberation ratio (SRR) since the power spectra of clean speech and the reverberation signal can be estimated by power SS or GSS using MCLMS. A diagram of the modified proposed method combining GSS with MFT is shown in Figure 1b.
In this paper, we also investigate the robustness of the SS-based reverberation under various reverberant conditions for large vocabulary continuous speech recognition (LVCSR). We analyze the effect factors (numbers of reverberation windows and channels, length of utterance, and the distance between sound source and microphone) of compensation parameter estimation for dereverberation based on SS.
The remainder of this paper is organized as follows: Section 2 describes the outline of blind dereverberation based on SS. A MFT for dereverberation is described in Section 3. A one-stage dereverberation and denoising method is proposed in Section 4, while Section 5 describes the experimental results of distant speech recognition in a reverberant environment. Finally, Section 6 summarizes the paper.
2. Outline of blind dereverberation
2.1 Dereverberation based on power SS
where * denotes the convolution operation. In this paper, additive noise is ignored for simplification, so Equation (1) becomes x[t] = h[t] * s[t].
where f is the frame index, H(ω) is the STFT of the impulse response, S(f, ω) is the STFT of clean speech s, D is number of reverberation windows, and H(d, ω) denotes the part of H(ω) corresponding to the frame delay d. That is, with a long impulse response, the channel distortion is no longer of a multiplicative nature in a linear spectral domain but is rather convolutional .
where H(d, ω), d = 0,1, ...,D-1 is the STFT of impulse response, which can be calculated from the known impulse response or can be blindly estimated.
where X(ω) is the spectrum of the input speech x(t).
2.2 Dereverberation based on GSS
where n is the exponent parameter. For power SS, the exponent parameter n is equal to 1. In this paper, the exponent parameter n is set to 0.1 as this value yielded the best results in .
2.3 Compensation parameter estimation for SS by multi-channel LMS algorithm
In , an adaptive multi-channel LMS algorithm for blind single-input multiple-output (SIMO) system identification was proposed.
where h n (t, l) is the l th tap of the n th impulse response at time t. If the SIMO system is blindly identifiable, the matrix Rx+is rank deficient by 1 (in the absence of noise) and the channel impulse responses can be uniquely determined.
where and is the estimated model filter at time t. Here, we put a tilde in to distinguish this instantaneous value from its mathematical expectation .
By minimizing the cost function J of Equation (21), the impulse response can be blindly derived. Wang et al.  extended this VSS-UMCLMS algorithm , which identifies the multi-channel impulse responses, for processing in a frequency domain with SS applied in combination.
3. Missing feature theory for dereverberation
MFT  enhances the robustness of speech recognition to noise by rejecting unreliable acoustic features using a missing feature mask (MFM). The MFM is the reliability corresponding to each spectral component, with 0 and 1 being unreliable and reliable, respectively. The MFM is typically a hard and a soft mask. The hard mask applies binary reliability values of 0 or 1 to each spectral component and is generated using the signal-to-noise ratio (SNR). The reliability is 0 when the SNR is greater than a manually-defined threshold, otherwise it is 1. The soft mask is considered a better approach than the hard mask and applies a continuous value between 0 and 1 using a sigmoid function.
where a and b are the gradient and center of the sigmoid function, respectively, and are empirically determined. Finally, the estimated spectrum of clean speech from Equation (10) is multiplied by the reliability r(f, ω), and the inverse DFT of forms the dereverberant speech.
4. One-stage dereverberation and denoising based on GSS
where X N (f, ω) is spectrum by subtracting the spectrum of observed speech with the spectrum of noise and is mean vector of X N (f, ω).
5.1 Experimental setup
Details of recording conditions for impulse response measurement
(a) RWCP database
Echo room (panel)
Echo room (cylinder)
Tatami-floored room (S)
Tatami-floored room (S)
Tatami-floored room (L)
Tatami-floored room (L)
Echo room (panel)
(b) CENSREC-4 database
9.0 × 6.0 m
Japanese style room
3.5 × 2.5 m
11.5 × 27.0 m
Japanese style bath
1.5 × 1.0 m
7.0 × 3.0 m
7.0 × 8.5 m
11.5 × 6.5 m
Conditions for speech recognition
5 states, 3 output probability left-to-right triphone HMMs
25 dimensions with CMN (12MFCCs + Δ + Δpower)
Conditions for SS-based dereverberation
Number of reverberant windows D
Noise overestimation factor α
1.0 (Power SS)
Spectral floor parameter β
Soft-mask gradient parameter a
0.05 (Power SS)
Soft-mask center parameter b
Channel number corresponding to Figure 3a using for dereverberation and denoising (RWCP database)
17, 21, 25, 29
1, 5, 9, 13
17, 19, 21, 23,
1, 3, 5, 7, 9,
25, 27, 29, 30
11, 13, 15, 17
5.2 Effect factor analysis of compensation parameter estimation
In this section, we describe the use of four microphones b to estimate the spectrum of the impulse responses without a particular explanation. Delay-and-sum beamforming (BF) was performed on the 4-channel dereverberant speech signals. For the proposed method, each speech channel was compensated by the corresponding estimated impulse response. Preliminary experimental results for isolated word recognition showed that the SS-based dereverberation method significantly improved the speech recognition performance significantly compared with traditional CMN with beamforming .
In this section, we also analyzed the effect factor (number of reverberation windows D in Equation (9), channel number, and length of utterance) for compensation parameter estimation for the dereverberation method based on SS using RWCP database.
Detail results based on different number of reverberation windows D and reverberant environments (%)
Array number #
Number of reverberation windows D
5.3 Experimental results of dereverberation and denoising
In this section, reverberation and noise suppression using only 2 speech channels is described. c
In both SS-based and GSS-based dereverberation methods, speech signals from two microphones were used to estimate blindly the compensation parameters for the power SS and GSS (that is, the spectra of the channel impulse responses), and then reverberation was suppressed by SS and the spectrum of dereverberant speech was inverted into a time domain. Finally, delay-and-sum beamforming was performed on the two-channel dereverberant speech. The schematic of dereverberation is shown in Figure 1.
Word accuracy for LVCSR (%)
Distorted speech #
Breakdown of speech recognition errors (%)
Word accuracy for one-stage dereverberation and denoising (%)
CMN with GSS-based noise reduction
One-stage dereverberation and denoising based on GSS
Average values of the estimated spectra of impulse responses from noise-free and additive noise conditions and their difference
Previously, Wang et al.  proposed a blind dereverberation method based on power SS employing the multi-channel LMS algorithm for distant-talking speech recognition. Previous studies showed that GSS with an arbitrary exponent parameter is more effective than power SS for noise reduction. In this paper, GSS is applied instead of power SS to suppress late reverberation. However, reverberation cannot be completely suppressed owing to the estimation error of the impulse response. MFT is used to enhance the robustness of noise. Soft-mask estimation-based MFT calculates the reliability of each spectral component from SNR. In this paper, reliability was estimated through the signal-to-reverberation ratio. Furthermore, delay-and-sum beamforming was also applied to the multi-channel speech compensated by the reverberation compensation method. Our SS and GSS-based dereverberation methods were evaluated using distorted speech signals simulated by convolving multi-channel impulse responses with clean speech. When the additive noise was absent, the GSS-based method without MFT achieved an average relative word error reduction rate of 31.4% compared to conventional CMN and 9.8% compared to the power SS-based method without MFT. When MFT was combined with both our methods, further improvement was obtained. The GSS-based method with MFT achieved average relative word error reduction rates of 32.6 and 11.4% compared to conventional CMN and the original proposed method, respectively. The one-stage dereverberation and denoising method based on GSS achieved a relative word error reduction rate of 12.8% compared to the conventional CMN with GSS-based additive noise reduction method.
In this paper, we also investigated the effect factors (numbers of reverberation windows and channels, and length of utterance) for compensation parameter estimation. We reached the following conclusions: (1) the speech recognition performance with the number of reverberation windows between 4 and 10 did not vary greatly and was significantly better than the baseline, (2) the compensation parameter estimation was robust to the number of channels, and (3) degradation of speech recognition did not occur with the length of utterance used for parameter estimation longer than 1 s.
aFor example, to estimate the clean power spectrum of the 2i th window W2i, the estimated clean power spectra of the 2(i-1)th window W2(i-1), the 2(i-2)th window W2(i-2), ... were used. bFor RWCP database, 4 speech channels shown in Table 4 were used. For CENSREC-4 database, speech channels 1, 3, 5, and 7 shown in Figure 3b were used. cFor RWCP database, 2 speech channels shown in Table 4 were used. For CENSREC-4 database, speech channels 1 and 3 shown in Figure 3b were used.
- Huang Y, Benesty J, Chen J: Acoustic MIMO Signal Processing. Springer-Verlag, Berlin; 2006.Google Scholar
- Maganti H, Matassoni M: An auditory based modulation spectral feature for reverberant speech recognition. In Proceedings of INTERSPEECH 2010. Makuhari, Japan; 2010:570-573.Google Scholar
- Raut C, Nishimoto T, Sagayama S: Adaptation for long convolutional distortion by maximum likelihood based state filtering approach. Proc ICASSP 2006, 1: 1133-1136.Google Scholar
- Subramaniam S, Petropulu AP, Wendt C: Cepstrum-based deconvolution for speech dereverberation. IEEE Trans Speech Audio Process 1996, 4(5):392-396. 10.1109/89.536934View ArticleGoogle Scholar
- Avendano C, Hermansky H: Study on the dereverberation of speech based on temporal envelope filtering. In Proceedings of ICSLP-1996. Philadelphia, USA; 1996:889-892.Google Scholar
- Avendano C, Tibrewala S, Hermansky H: Multiresolution channel normalization for ASR in reverberation environments. In Proceedings of EUROSPEECH-1997. Rhodes, Greece; 1997:1107-1110.Google Scholar
- Hermansky H, Wan EA, Avendano C: Speech enhancement based on temporal processing. In Proceedings of ICASSP-1995. Seattle WA, USA; 1995:405-408.Google Scholar
- Gannot S, Moonen M: Subspace methods for multimicrophone speech dereverberation. EURASIP J Appl Signal Process 2003, 2003(1):1074-1090. 10.1155/S1110865703305049View ArticleGoogle Scholar
- Jin Q, Pan Y, Schultz T: Far-field speaker recognition. Proc ICASSP 2006, 1: 937-940.Google Scholar
- Jin Q, Schultz T, Waibel A: Far-field speaker recognition. IEEE Trans ASLP 2007, 15(7):2023-2032.Google Scholar
- Huang Y, Benesty J: Adaptive blind channel identification: multi-channel least mean square and Newton algorithms. ICASSP II 2002, 1637-1640.Google Scholar
- Huang Y, Benesty J: Adaptive multi-channel least mean square and Newton algorithms for blind channel identification. Signal Process 2002, 82: 1127-1138. 10.1016/S0165-1684(02)00247-5View ArticleGoogle Scholar
- Huang Y, Benesty J, Chen J: Optimal step size of the adaptive multi-channel LMS algorithm for blind SIMO identification. IEEE Signal Process Lett 2005, 12(3):173-175.View ArticleGoogle Scholar
- Wang L, Kitaoka N, Nakagawa S: Distant-talking speech recognition based on spectral subtraction by multi-channel LMS algorithm. IEICE Trans Inf Syst 2011, E94-D(3):659-667. 10.1587/transinf.E94.D.659View ArticleGoogle Scholar
- Sim BL, Tong YC, Chang JS, Tan CT: A parametric formulation of the generalized spectral subtraction method. IEEE Trans Speech Audio Process 1998, 6(4):328-337. 10.1109/89.701361View ArticleGoogle Scholar
- Bhiksha Raj, Richard M Stern: Missing-feature approaches in speech recognition. IEEE Signal Process Mag 2005, 22(9):101-116.Google Scholar
- Palomaki Kalle J, Guy J Brown, Barker Jon: Missing data speech recognition in reverberant conditions. In Proceedings of ICASSP-2002. Orlando, FL; 2002:65-68.Google Scholar
- Makino S, Niyada K, Mafune Y, Kido K: Tohoku University and Panasonic isolated spoken word database. J Acoust Soc Jpn 1992, 48(12):899-905. (in Japanese)Google Scholar
- Nishiura T, Gruhn R, Nakamura S: Evaluation framework for distant-talking speech recognition under reverberant environments. In Proceedings of INTERSPEECH-2008. Brisbane, Australia; 2008:968-971.Google Scholar
- Itou K, Yamamoto M, Takeda K, Takezawa T, Matsuoka T, Kobayashi T, Shikano K, Itahashi S: JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research. J Acoust Soc Jpn (E) 1999, 20(3):199-206. 10.1250/ast.20.199View ArticleGoogle Scholar
- Lee A, Kawahara T, Shikano K: Julius--an open source real-time large vocabulary recognition engine. In Proceedings of European Conference on Speech Communication and Technology. Aalborg, Denmark; 2001:1691-1694.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.