- Research
- Open access
- Published:
Combination of MVDR beamforming and single-channel spectral processing for enhancing noisy and reverberant speech
EURASIP Journal on Advances in Signal Processing volume 2015, Article number: 61 (2015)
Abstract
This paper presents a system aiming at joint dereverberation and noise reduction by applying a combination of a beamformer with a single-channel spectral enhancement scheme. First, a minimum variance distortionless response beamformer with an online estimated noise coherence matrix is used to suppress noise and reverberation. The output of this beamformer is then processed by a single-channel spectral enhancement scheme, based on statistical room acoustics, minimum statistics, and temporal cepstrum smoothing, to suppress residual noise and reverberation. The evaluation is conducted using the REVERB challenge corpus, designed to evaluate speech enhancement algorithms in the presence of both reverberation and noise. The proposed system is evaluated using instrumental speech quality measures, the performance of an automatic speech recognition system, and a subjective evaluation of the speech quality based on a MUSHRA test. The performance achieved by beamforming, single-channel spectral enhancement, and their combination are compared, and experimental results show that the proposed system is effective in suppressing both reverberation and noise while improving the speech quality. The achieved improvements are particularly significant in conditions with high reverberation times.
1 Introduction
In many speech communication applications, such as voice-controlled systems or hearing aids, distant microphones are used to record a target speaker. The microphone signals are often corrupted by both reverberation and noise, resulting in a degraded speech quality and speech intelligibility, as well as in a reduced performance of automatic speech recognition (ASR) systems.
Several algorithms have been proposed in the literature to deal with these issues (cf. [1–3] and the references therein). This paper extends the description and evaluation of the system proposed by the authors in [4], which consists of a commonly used combination of a minimum variance distortionless response (MVDR) beamformer with a single-channel spectral enhancement scheme. In such a combined system, the spectral enhancement scheme typically consists in applying a real-valued spectral gain to the short-time Fourier transform (STFT) of the beamformer output. The computation of this spectral gain relies on estimates of the power spectral densities (PSDs) of the interference to be suppressed, i.e., noise and reverberation, as early reflections are often considered to be beneficial both in terms of speech quality [5] and ASR performance [6].
Different methods have been proposed for estimating the late reverberant and noise PSDs, e.g. relying on assumptions about the sound field or on a voice activity detector (VAD). The PSDs of the noise and reverberation can be estimated using the output signal(s) of a blocking matrix, suppressing the signal to be preserved, in the well-known generalized sidelobe canceller (GSC) structure. The blocking matrix can be designed, e.g., as a delay-and-subtract beamformer cancelling the direct speech component [7, 8] or based on a blind source separation (BSS) scheme aiming to cancel both the direct speech component and the early reflections [9, 10]. Alternatively, the PSD at a reference position can be obtained using a maximum likelihood estimator (MLE) and a model of the sound field [11]. The PSD to be used in the computation of the spectral postfilter is then obtained by correcting the estimated PSD at the reference position. This correction can be done using an adaptive filter [8], back-projection [9, 10], or the relative transfer functions between the target speaker and the microphones [11].
Other methods estimate the PSD of the interference from the output of the beamformer and thus can in principle also be used if only one microphone is available. In such methods [4, 12], the estimation of the noise PSD is often derived from statistical models of the speech and noise [13, 14]. The estimation of the reverberant PSD can, e.g., be derived from a statistical model of the room impulse response (RIR) and the acoustical properties of the room, such as the reverberation time (T 60) or the direct-to-reverberant ratio (DRR) [15, 16].
In the system presented in this paper, the microphone signals are first processed using an MVDR beamformer [17], which aims to suppress sound sources not arriving from the direction of arrival (DOA) of the target speaker, while maintaining a unit gain towards this DOA. The noise coherence matrix used to compute the coefficients of the MVDR beamformer is estimated online using a VAD [18], and the DOA of the target speaker is estimated using the multiple signal classification (MUSIC) algorithm [19, 20]. The beamformer output is processed using a single-channel spectral enhancement scheme, which aims at jointly suppressing the residual noise and reverberation. The main novel contribution of this paper is the combination of the several estimators used in the single-channel spectral enhancement scheme. This spectral enhancement scheme relies on estimates of the PSDs of the noise and the late reverberation, similarly as in [21]. The proposed scheme computes a real-valued spectral gain, combining the clean speech amplitude estimator presented in [22], the noise PSD estimator based on minimum statistics (MS) [13], and an estimator of the (late) reverberant PSD based on statistical room acoustics [15, 23]. In order to reduce the musical noise which is often a byproduct of spectral enhancement schemes, adaptive smoothing in the cepstral domain is used to estimate the speech PSD [24, 25].
The proposed system is evaluated using the REVERB challenge corpus [26], which permits the evaluation of algorithms under realistic conditions in single- and multi-channel scenarios. The single-channel scenario is particularly challenging as illustrated by the results of the REVERB challenge workshop [27], in which most contributions succeeded to reduce reverberation but only a few improved the speech quality [4, 12]. The evaluation is conducted for different configurations of the proposed system in terms of instrumental speech quality measures, improvement of ASR performance, and a subjective evaluation of speech quality and dereverberation using a MUSHRA test [28]. The evaluation results show that the proposed system is able to reduce noise and reverberation while improving the speech quality in both single- and multi-channelscenarios.
This paper is organized as follows. In Section 2, an overview of the proposed system is given. Details about the proposed MVDR beamformer and the single-channel spectral enhancement scheme are presented in Section 3 and in Section 4, respectively. The evaluation corpus is briefly described in Section 5 and the evaluation results are presented in Section 6.
2 System overview
When recording a single speech source in an enclosure using M microphones, the reverberant and noisy mth microphone signal y m (n) at time index n is given by
with s(n) denoting the clean speech signal, h m (n) denoting the RIR between the speech source and the mth microphone, and x m (n) and v m (n) denoting the reverberant speech component and the additive noise component in the mth microphone signal, respectively. The STFT representations of y m (n), s(n), x m (n), and v m (n) are denoted by Y m (k,ℓ), S(k,ℓ), X m (k,ℓ), and V m (k,ℓ), respectively, with k and ℓ representing the discrete frequency bin and frame indices, respectively.
The proposed system, depicted in Fig. 1, aims at obtaining an estimate \(\hat {s}(n)\), with \(\hat {\cdot }\) denoting estimated quantities, of the clean speech signal s(n) from the reverberant and noisy microphone signals, y m (n). This system consists of two stages. First, an MVDR beamformer is applied to the microphone signals. This beamformer aims at reducing noise and reverberation by suppressing the sound sources not arriving from the target DOA, while providing a unity gain in the direction of the target speaker. The noise coherence matrix and the DOA used to compute the MVDR beamformer coefficients are estimated from the received microphone signals y m (n). The noise coherence matrix is estimated using a VAD [18], whereas the DOA estimation is based on the MUSIC algorithm [19, 20], cf. Section 3. In order to suppress the residual noise and reverberation at the beamformer output \(\tilde {x}(n)\), the beamformer output is processed by a single-channel spectral enhancement scheme, cf. Section 4.
3 Beamformer
3.1 MVDR beamforming
In the STFT domain, (2) can be expressed as
which in vector notation can be written as
with
denoting the M-dimensional stacked vector of the received microphone signals and X(k,ℓ) and V(k,ℓ) denoting the stacked vectors of the reverberant speech component and noise component, respectively, defined in the same way as in (5).
In the STFT domain, the beamformer output signal \(\tilde {x}(n)\) is denoted by \(\tilde {X}(k,\ell)\) and obtained by filtering and summing the microphone signals, i.e.,
with W θ (k) denoting the stacked filter coefficient vector of the beamformer steered towards the angle θ.
Aiming at minimizing the noise power while providing a unity gain in the direction of the target speaker, the filter coefficients of the MVDR beamformer are computed as [17]
where d θ (k) and Γ(k) denote the steering vector of the target speaker and the noise coherence matrix, respectively. Using a far-field assumption, the steering vector d θ (k) is equal to
with f k denoting the center frequency of frequency bin k and τ m (θ) denoting the time difference of arrival of the source at angle θ between the mth microphone and a reference position, which has been arbitrarily chosen as the center of the microphone array.
To compute the MVDR beamformer filter coefficients, an estimate \(\hat {\theta }\) of the DOA of the target speaker as well as an estimate of the noise coherence matrix is required.
3.2 Noise coherence matrix estimation
The noise coherence matrix is estimated during noise-only periods detected using the VAD described in [18], as the covariance matrix of the noise-only components, i.e.
with \(\mathbb {L}_{v}\) denoting the set of detected noise-only frames and \(\overline {\overline {\mathbb {L}}}_{v}\) its cardinality.
However, if the detected noise-only period is too short for a reliable estimate (cf. Section 5), the coherence matrix \(\overline {\mathbf {\Gamma }}(k)\) of a diffuse noise field is used instead, i.e., the coherence between two microphones i and i ′, separated by a distance \(l_{i,i^{\prime }}\), is computed as
with c denoting the speed of sound, resulting in the well-known superdirective beamformer [17]. Additionally, a white noise gain constraint WNGmax is imposed in order to limit the potential amplification of uncorrelated noise, especially at low frequencies. With such a constraint, the used noise coherence matrix is equal to
with I M denoting the M×M-dimensional identity matrix and ϱ(k) denoting a frequency-dependent regularization parameter which is computed iteratively such that \(\textbf {W}^{H}_{\theta }(k) \textbf {W}_{\theta }(k)\leq \text {WNG}_{\text {max}}\) [29].
3.3 DOA estimation
As the beamformer aims at suppressing sources not arriving from the target DOA, an error in the DOA estimate may lead to suppression of the desired source by the beamformer. In the proposed system, the subspace-based MUSIC algorithm [19, 20], shown robust in our target application (cf. Section 6.1), has been used to compute the DOA estimate \(\hat {\theta }\).
Assuming that speech and noise are uncorrelated, the steering vector corresponding to the true DOA is orthogonal to the noise subspace, which is represented by an M×(M−Q)-dimensional matrix, with Q the number of sources (i.e., Q=1 in this case), defined as
The noise subspace E(k,ℓ) is composed of the eigenvectors of the covariance matrix of Y(k,ℓ) corresponding to the (M−Q) smallest eigenvalues.
The MUSIC algorithm then estimates the DOA as the angle maximizing the sum of the MUSIC pseudo-spectra
over a given frequency range, i.e.,
with K denoting the total number of considered frequency bins k=k low…k high.
4 Single-channel spectral enhancement
Although the beamformer in Section 3.1 is able to reduce the interference, i.e., noise and reverberation, to some extent, spectral enhancement schemes are able to further reduce reverberation as well as noise. The output signal \(\tilde {X}(k,\ell)\) of the MVDR beamformer contains the clean speech signal S(k,ℓ) as well as residual reverberation R(k,ℓ) and residual noise \({\tilde {V}}(k,\ell)\), i.e.
with
the reverberant speech component. Aiming at jointly reducing residual reverberation and noise, the single-channel spectral enhancement scheme summarized in Fig. 2 is proposed, where a real-valued spectral gain G(k,ℓ) is applied to the STFT coefficients of the beamformer output, i.e.,
with \(\hat {S}(k,\ell)\) denoting the STFT of the estimated speech signal.
The spectral gain G(k,ℓ) is computed using the minimum mean square error (MMSE) estimator for the clean speech spectral magnitude as proposed in [22] (cf. Section 4.1). This estimator, similarly to the Wiener filter, requires the PSDs of the clean speech, the noise, and the reverberation components.
First, an estimate \(\hat {\sigma }^{2}_{\tilde {v}}(k,\ell)\) of the noise PSD is obtained based on a slight modification of the well-known minimum statistics (MS) approach [13] (cf. Section 4.2) and used to estimate the reverberant speech PSD. The estimate \(\hat {\sigma }^{2}_{z}(k,\ell)\) of the reverberant speech PSD is computed using temporal cepstrum smoothing [24, 25] (cf. Section 4.3). The estimate \(\hat {\sigma }^{2}_{r}(k,\ell)\) of the (late) reverberant PSD is computed from the reverberant speech PSD estimate using the approach proposed in [15] (cf. Section 4.4). This approach requires an estimate of the reverberation time T 60, which has been obtained using the estimator described in [30]. As the dereverberation task is treated separately from the denoising task, care has to be taken that no reverberation leaks into the noise PSD estimate and vice versa. Thus, a longer minimum search window is used in the MS approach as compared to [13] (cf. Section 5.2).
The estimate \(\hat {\sigma }^{2}_{s}(k,\ell)\) of the clean speech PSD is finally obtained by a re-estimation, again using temporal cepstrum smoothing. The following subsections give a more detailed description of the different components of the proposed single-channel spectral enhancement scheme.
4.1 Spectral gain
The gain function used in the spectral enhancement scheme has been proposed in [22] to estimate the spectral magnitude of the clean speech. This estimator is derived by modeling the speech magnitude |S(k,ℓ)| as a stochastic variable with a chi probability density function (pdf) with shape parameter μ, while the phase of S(k,ℓ) is assumed to be uniformly distributed between −π and π. Furthermore, the interference \({J}(k,\ell) = {R}(k,\ell) + {\tilde {V}}(k,\ell)\) is modeled as a complex Gaussian random variable with PSD \({\sigma _{j}^{2}}(k,\ell) \). Assuming that R(k,ℓ) and \({\tilde {V}}(k,\ell)\) are uncorrelated, \({\sigma _{j}^{2}}(k,\ell)\) can be expressed as
with \({\sigma _{r}^{2}}(k,\ell)\) and \(\sigma _{\tilde {v}}^{2}(k,\ell)\) denoting the PSDs of the reverberation and of the noise, respectively.
The squared distance between the amplitudes (to the power β) of the clean speech S(k,ℓ) and the estimated output \(\hat {{S}}(k,\ell)\) is defined as
The parameter β, typically chosen as 0<β≤1, is a compression factor resulting in a different emphasis given on estimation errors for small amplitudes in relation to large amplitudes. The clean speech magnitude is estimated by optimizing the MMSE criterion
with ξ(k,ℓ) denoting the a priori signal-to-interference ratio (SIR) defined as
with \({\sigma _{s}^{2}}(k,\ell)\) denoting the PSD of the clean speech.
As shown in [22], the solution to (20) leads to the spectral gain \(\tilde {G}(k,\ell)\)
with γ(k,ℓ) denoting the a posteriori SIR, defined as
and
with Φ(·) denoting the confluent hypergeometric function and Gam(·) denoting the complete Gamma function [31]. Depending on the choice of β and μ, the solution in (22) can resemble other well-known estimators, such as the short-time spectral amplitude estimator (β=1, μ=1) [32] or the log-spectral amplitude estimator (β=0, μ=1) [33]. In order to reduce artifacts which may be introduced by directly applying (22), the spectral gain G(k,ℓ) in (17) is restricted to values larger than a spectral floor G min (cf. Section 5.2), i.e.,
To compute the expression in (22), the PSDs \({\sigma _{s}^{2}}(k,\ell)\), \(\sigma _{\tilde {v}}^{2}(k,\ell)\), and \({\sigma _{r}^{2}}(k,\ell)\) have to be estimated from the beamformer output. The used estimators are described in the next subsections.
4.2 Noise PSD estimator
The MS [13] approach has been shown to be a reliable estimator of the noise PSD for moderately time-varying noise conditions. This approach relies on the assumption that the minimum of the noisy speech power, \(P_{\tilde {x}}(k,\ell)\), over a short temporal sliding window is not affected by the speech. The noise PSD \(\sigma _{\tilde {v}}^{2}(k,\ell)\) is then estimated by tracking the minimum of \(P_{\tilde {x}}(k,\ell)\) over this sliding window, whose usual length corresponds to 1.5 s according to [13].
Figure 3 depicts the powers of anechoic speech, reverberant speech, and additive noise for one frequency bin of their power spectrograms. As illustrated in this figure, the decay time in speech pauses is typically increased in the presence of reverberation. Consequently, a longer tracking window is used in the proposed spectral enhancement scheme (cf. Section 5) in order to avoid reverberant speech affecting the estimation of the noise PSD \(\sigma _{\tilde {v}}^{2}(k,\ell)\).
4.3 Speech PSD estimator
Temporal cepstrum smoothing, as proposed in [24], is used to estimate the PSD \({\sigma _{z}^{2}}(k,\ell)\) of the reverberant speech component Z(k,ℓ) as well as the PSD \({\sigma _{s}^{2}}(k,\ell)\) of the dereverberated speech signal S(k,ℓ). The estimation of \({\sigma _{z}^{2}}(k,\ell)\) only requires the noise PSD estimate \(\hat {\sigma }_{\tilde {v}}^{2}\left (k,\ell \right)\) whereas the estimation of \({\sigma _{s}^{2}}(k,\ell)\) additionally requires an estimate of the reverberant PSD \({\sigma _{r}^{2}}(k,\ell)\), as depicted in Fig. 2. The modifications required for the latter case are described at the end of this section.
In order to estimate the reverberant speech PSD \({\sigma _{z}^{2}}(k,\ell)\), the maximum likelihood (ML) estimator of the a priori signal to noise ratio (SNR)
is employed. An estimate \(\hat {\sigma }^{2}_{z_{\text {ml}}}(k,\ell)\) of the reverberant speech PSD can then be obtained as
with \(\xi _{\text {ml}}^{\text {min}}>0\) denoting a lower bound to avoid negative or very small values of \(\xi _{z_{\text {ml}}}(k,\ell)\).
In the cepstral domain, \(\hat {\sigma }^{2}_{z_{\text {ml}}}(k,\ell)\) can be represented by
with q denoting the cepstral bin index and L denoting the length of the FFT. A recursive temporal smoothing is applied to \(\lambda _{z_{\text {ml}}}(q,\ell)\), i.e.,
with δ(q,ℓ) denoting a time-quefrency-dependent smoothing parameter. Only a mild smoothing is applied to the quefrencies which are mainly related to speech, while for the remaining quefrencies, a stronger smoothing is applied. Consequently, a small smoothing parameter is chosen for the low quefrencies, as they contain information about the vocal tract shape, and for the quefrencies corresponding to the fundamental frequency f 0 in voiced speech. In order to protect these quefrencies, especially the ones corresponding to the fundamental frequency, the parameter δ(q,ℓ) in (29) is adapted. After determining f 0 by picking the highest peak in the cepstrum within a limited search range, δ(q,ℓ) is defined as
with \(\mathbb {Q}\) denoting a small set of cepstral bins around the quefrency corresponding to f 0 and δ pitch the smoothing parameter for the quefrency bins within \(\mathbb {Q}\) [24]. The quantity \(\bar {\delta }(q,\ell)\) is given as
where \(\bar {\delta }_{\text {const}}(q)\) is time independent and chosen such that less smoothing is applied in the lower cepstral bins. Furthermore, η is a forgetting factor which defines how fast the transition from δ(q,ℓ) to \(\bar {\delta }_{\text {const}}(q)\) can occur (cf. Section 5.2). Finally, the reverberant speech PSD estimate \(\hat {\sigma }_{z}^{2}\left (k,\ell \right)\) can be obtained by transforming λ z (q,ℓ) back to the spectral domain, i.e.
with κ denoting a parameter to compensate for the bias due to the recursive smoothing in the log domain in (29) and is estimated as in [25].
The estimate of the reverberant speech PSD can be used to estimate the reverberant PSD \({\sigma _{r}^{2}}(k,\ell)\) (cf. Section 4.4). After having estimated \({\sigma _{r}^{2}}(k,\ell)\), cepstral smoothing is also used to estimate the dereverberated clean speech PSD \({\sigma _{s}^{2}}(k,\ell)\). In this case, the noise PSD \(\sigma _{\tilde {v}}^{2}(k,\ell)\) in (26) and (27) is replaced by the interference PSD \({\sigma _{j}^{2}}(k,\ell) = \sigma _{\tilde {v}}^{2}(k,\ell) + {\sigma _{r}^{2}}(k,\ell)\).
4.4 Reverberant PSD estimation
The RIR model presented in [23] represents the RIR as a Gaussian noise signal multiplied by an exponential decay Δ, which depends on the room reverberation time, T 60, i.e.,
In the proposed spectral enhancement scheme, the approach derived from this model and presented in [15] is used to estimate the reverberant PSD \({\sigma _{r}^{2}}(k,\ell)\) as
with
In (34), T s denotes the frame shift whereas T d is the duration of the direct path and early reflections of the RIR, typically assumed to be between 50 and 80 ms. As a result, the estimate \(\hat {\sigma }_{r}^{2}\left (k,\ell \right)\) can be obtained using \(\hat {\sigma }_{z}^{2}\left (k,\ell \right)\) and an estimate of the reverberation time T 60 obtained using an online estimator such as the one proposed in [30].
Finally, using the estimated PSDs of the reverberation and of the residual noise, an estimate \(\hat {\sigma }^{2}_{s}(k,\ell)\) of the clean speech PSD is obtained. These estimates are used in (21) to compute the a priori SIR and in (22) to compute the real-valued spectral gain, \(\tilde {G}(k,\ell)\).
5 Experimental setup
5.1 Corpus description
The results presented in this paper have been obtained using the evaluation set of the REVERB challenge [26], which consists of a large corpus of speech corrupted by reverberation and noise. All recordings have been made at a sampling frequency of 16 kHz with a circular microphone array with 20 cm diameter and 8 equidistant microphones. This corpus is divided into simulated and real data. The simulated data is composed of clean speech signals taken from the WSJCAM0 corpus [34], which have been convolved with RIRs recorded in three different rooms and to which measured noise at a fixed SNR of 20 dB have been added. The real data is composed of utterances from the MC-WSJ-AV corpus [35] and contains speech recorded in a room in the presence of noise. The utterances have been spoken from different unknown positions within each room, but the position was constant during each utterance. For each room, two distances (denoted by “near” and “far”) between the target speaker and the center of the microphone array have been considered. The combination of a room and a particular distance will be refered to as “condition” in the remainder of this paper. The characteristics of each condition along with the labels used to refer to it are summarize in Table 1.
5.2 Algorithm settings
For the experiments, it has been assumed that the T 60 and the DOA of the target speaker remain constant for each utterance. Therefore, both T 60 and DOA have been estimated only once per utterance. The STFT has been computed using a 32-ms Hann window with 50 % overlap and an FFT of length L= 512. The DOA has been estimated as the angle minimizing the sum of the MUSIC pseudo-spectra, for θ= 0 ° … 360 ° for every 2 °, using all 8 microphones of the circular microphone array for the frequency range from 50 Hz to 5 kHz, cf. Section 3.3.
The MVDR beamformer uses a theoretically diffuse noise coherence matrix and a white noise gain constraint WNGmax= −10 dB if less than 10 frames are detected as noise when applying the VAD, cf. (11). The VAD has been configured similarly as in [18], but its parameters have been adapted in order to apply it to signals with a sampling frequency of 16 kHz. Otherwise, the noise coherence matrix is estimated using all detected noise-only frames, cf. (9). The speech amplitude estimator in Section 4.1 assumes a chi pdf with shape parameter μ=0.5, a minimum gain G min of −10 dB, and a compression parameter β=0.5. The noise PSD estimator described in Section 4.2 uses the same parameters as in [13], except for the length of the sliding window for minima tracking which has been set to either 1.5 s (SE 1.5) or 3 s (SE 3) in our experiments. In (31), η=0.96 and all parameters used for the speech PSD estimation, described in Section 4.3, have been set as prescribed in [22]. In (34), T d has been set to 80 ms.
6 Results
The performance of the proposed system for each condition is evaluated in terms of instrumental speech quality measures (cf. Section 6.2) as well as in terms of word error rate (WER) when using the proposed system as a preprocessing scheme for the REVERB challenge baseline ASR system (cf. Section 6.3). Additionally, the results obtained in a subjective speech quality evaluation are presented for 4 out of 8 conditions in Section 6.4.
The performance of the combined scheme is compared to the performance when applying only the single-channel spectral enhancement scheme to the first microphone signal and when applying only the MVDR beamformer to the multichannel input.
6.1 Observations on beamformer design
The MVDR beamformer used in this paper is steered towards the estimated DOA of the target speech signal. In practice, errors in the DOA estimation can result in speech degradation. Figure 4 (top) depicts the DOA error obtained in all conditions of the simulated data of the REVERB challenge (i.e., a total of 2176 utterances). The true DOA has been considered to be the one stated in the REVERB challenge data documentation [36]. Ignoring outliers, it can be seen that the absolute value of the error is smaller than 5 in room S1 while in room S2, it is smaller than 10 ° for 50 % of the data and always smaller than 15 °. As expected, the largest error in DOA estimation appears in the case of room S3, which has the largest reverberation time. It can be seen that for room S3, in 50 % of the utterances, the absolute value of the DOA error is inferior to 15 °. However, it can be as high as 28 ° for some utterances.
In order to assess the detrimental effect that such DOA error could have on the performance of the MVDR beamformer, one may examine its corresponding beampattern. Figure 4 (bottom) depicts the beampattern of the MVDR beamformer computed using the noise coherence matrix of a theoretically diffuse noise field as in (11), steered towards the zero degrees direction, and using the microphone configuration described in Section 5.1. By observing the width of the main lobe, it appears that the error in DOA is small enough to not introduce distortions in rooms S1 and S2. Some cancellation of the target speech signal may occur in room S3 but should be limited to frequencies higher than 4 kHz.
6.2 Instrumental speech quality measures
The performance in terms of instrumental speech quality measures for the different considered conditions is presented in Table 2 for the simulated data and in Table 3 for the real data. Since various instrumental speech quality measures exist which can be used to assess the quality of denoised and dereverberated signals [37–39] and since it is difficult to assess the quality using only one single measure, the performance of the proposed system has been evaluated using the five signal-based quality measures suggested in [26], i.e., the speech to reverberation modulation energy ratio (SRMR) [40], the cepstral distance (CD) [41], the log likelihood ratio (LLR) [41], the frequency-weighted segmental SNR (FWSSNR) [41], and the perceptual evaluation of speech quality (PESQ) [42]. Among these five quality measures, the SRMR is the only non-intrusive measure, i.e., not requiring a reference signal, and is hence the only measure that can be used to evaluate the performance for real data. The other measures use the clean speech signal s(n) as the reference signal.
For the single-channel case, Tables 2 and 3 compare the quality of the unprocessed (first microphone) signal (“Unp.” in tables) to the quality of the signal processed using the proposed spectral enhancement scheme using the standard MS window of 1.5 s (SE 1,5) as well as a longer window of 3 s (SE 3) for all acoustic conditions (rooms S1, S2, and S3 for positions “near” and “far”). For the 8-channel case, Tables 2 and 3 compare the quality of the output of the MVDR beamformer with and without spectral enhancement scheme, SE 1,5 and SE 3.
For each condition and for each instrumental quality measure, the best performance is highlighted by means of italic typeface to allow for an easier comparison. As expected, the selected instrumental measures do not always show completely consistent results [37, 38]. Nevertheless, some common tendencies can clearly be observed, which will be summarized next.
The results for all processed signals show an increase in SRMR, except for the MVDR beamformer in the case of room S2 (conditions “S2, near” and “S2, far”) of the simulated data. These conditions are also the only ones in which the SRMR is higher in the single-channel case than in the multi-channel case. This performance difference may result from unvalid noise coherence matrix or from error in the DOA estimate for some utterances. The fact that the spectral enhancement scheme, used either alone or in combination with the MVDR beamformer, always increases the SRMR illustrates the ability of the proposed system to reduce the amount of reverberation both in the single- and the multi-channel case.
Additionally, the presented FWSSNR values depict a significant increase in comparison to the unprocessed microphone signal for all processed signals, except for the MVDR beamformer in the case of room S2. This illustrates the noise reduction capabilities of the proposed system. The difference in the FWSSNR values between the single- and the multi-channel scenarios further illustrates the benefit of using an MVDR beamformer aiming at noise reduction in the first stage. It can be noted that using a sliding window of 3 s instead of 1.5 s improves the FWSSNR scores in all simulated conditions, both in the single- and the multi-channel case. The advantage of using this longer sliding window is also illustrated by the lower CD values, both in the single- and in the multi-channel case, suggesting that distortions have been limited by avoiding leakage of the reverberation into the noise PSD estimate. Except for room S1, with the lowest amount of reverberation, both CD and LLR values are lower for the processed signals than for the unprocessed signal.
Finally, the improvement in the overall perceptual quality of the processed signal is illustrated by means of the PESQ score, which increases up to 0.19 and 0.49 for the single- and multi-channel scenarios, respectively. The PESQ score is increased in all conditions, with the largest improvement being obtained by the combined system MVDR + SE 3.
6.3 Word error rate
In order to evaluate the potential benefit of the proposed signal enhancement scheme on the performance of an ASR system, the processed signals have been used as the input for the baseline speech recognition system provided by the REVERB challenge [26]. This system is based on the hidden Markov model toolkit (HTK) [43], using mel-frequency cepstral coefficients, including Deltas and double Deltas, as features and acoustic models with tied-state hidden Markov models with 10 Gaussian components per state. The ASR models provided by the REVERB challenge [26] have been trained on clean data containing 7861 sentences uttered by 92 speakers for a total of approximately 17.5 h. The achieved ASR performance is measured in terms of WER, as depicted in Fig. 5, for the different signal enhancement schemes and acoustic conditions.
Compared to the scores obtained using the unprocessed signals (cf. horizontal black lines in Fig. 5), the WER increases slightly for the conditions with the lowest reverberation time (room S1). This indicates that spectral coloration introduced by the enhancement scheme may reduce the performance of the ASR system while the benefit of dereverberation is limited for small reverberation times. In all other conditions, the single-channel spectral enhancement scheme reduces the WER, with SE 3 yielding larger improvements than SE 1.5. Except for room S3, the MVDR beamformer yields better results than the single-channel scheme. The combination of the MVDR beamformer with SE 3 yields the largest improvement: absolute WER improvement up to 44.28 % for the simulated data (condition “S2, far”) and up to 29.48 % for the real data (condition “R1, near”).
6.4 Subjective evaluation of the speech quality
Since instrumental quality assessment, especially for the task of assessing dereverberation performance, may not always correlate well with the opinion of human listeners [37], we conducted a listening experiment in addition to the instrumental quality assessment described before.
The subjective evaluation is based on a multi-stimulus test with hidden reference and anchor (MUSHRA) following the specifications described in [28]. Four acoustic conditions have been tested, “S2, near’; “S2, far”; “R1, near”; and “R1, far”. These conditions have been chosen to match the conditions used in the online MUSHRA test conducted in [27]. We have carried out a subjective evaluation for the unprocessed signal and for 3 processing schemes, namely, the single-channel scheme applied to the first microphone signal (SE 3), the MVDR beamformer using 8 microphones (MVDR), and the combination of the MVDR beamformer with the spectral enhancement scheme (MVDR + SE 3). In addition to these signals, a hidden reference and an anchor have been presented to the subjects. The hidden reference was the anechoic speech signal in the case of simulated data and the signal recorded by a headset microphone in the case of real data. The anchor consisted of the first microphone signal, low-pass filtered with a cut-off frequency of 3.5 kHz.
A total of 21 self-reported normal-hearing listeners participated in the MUSHRA listening test. The listening test was conducted in a soundproof booth and the subjects listened to diotic signals through headphones (Seinheiser HD 380 pro). Each subject evaluated 3 utterances per condition (i.e., 12 uterances per subject), in terms of two different attributes: “overall quality” and “perceived amount of reverberation”, on a scale ranging from 0 to 100. For each subject, the utterances to be evaluated were randomly picked from the REVERB challenge database. All signals were normalized in amplitude and presented at a sampling frequency of 16 kHz and a quantization of 16 bit using a Roland sound card (model UA-25EXCW). The listening test was divided into three stages. In the first stage, the subjects were asked to listen to all files that would be presented to them during a training phase. This training phase allowed the subjects to get familiar with the data to be evaluated and to adjust the sound volume to a comfortable level. In the second stage, the subjects had to evaluate the overall quality of the signals and finally, the third stage consisted in the evaluation of the perceived amount of reverberation. The order of presentation of algorithms and conditions were randomized between all stages and all subjects.
The obtained MUSHRA scores are summarized in Fig. 6. The anchor appears to be the least satisfactory for the attribute “overall quality,” suggesting that the subjects used the full extent of the grading scale. However, this is not the case for the attribute “perceived amount of reverberation”, illustrating the difficulty of evaluating this attribute. The three considered processing schemes yielded an improvement compared to the unprocessed signal both in terms of “overall quality” and of “perceived amount of reverberation”. As expected, the largest reduction of the “perceived amount of reverberation” is observed for the combination MVDR + SE 3. The combination MVDR + SE 3 improves the overall quality as well, although the improvement, compared to the single-channel scheme, is lower than for the attribute “perceived amount of reverberation”. The use of an MVDR beamformer alone reduces the “perceived amount of reverberation” but does not improve the performance compared to the single-channel processing scheme (SE 3).
Since the scores of the MUSHRA test were not normally distributed, a Friedman’s test [44] was used to examine the significance of the results, excluding the scores of the anchor and the reference. The results of the Friedman’s test are presented in Table 4. The p value, p<0.01, shows that at least one significant pairwise difference can be observed in all conditions and for all attributes. In order to examine the significance of the pairwise difference in performance between the processing schemes, a Wilcoxon rank sum test [45] has been used for each condition separately. A Bonferroni correction has been applied resulting in significant effects being considered for p<0.05/6. For the attribute “perceived amount of reverberation”, the differences in performance between the unprocessed signal and all processing schemes are significant but no significant differences were present between the different processing schemes. The same conclusion holds for the attribute “overall quality”, except for the room R1 and the condition “S2, near”, where the differences between the unprocessed signal and the output of the MVDR beamformer do not appear to be significant.
Even though the statistical significance criterion is not always satisfied, the trend of the results confirm the benefits of combining a beamformer with a single-channel spectral enhancement scheme for reducing reverberation and noise and for improving the overall speech quality.
7 Conclusions
In this paper, we have presented the combination of an MVDR beamformer with a single-channel spectral enhancement scheme, aiming at joint dereverberation and noise reduction. In the MVDR beamformer, the noise coherence matrix is estimated online using a VAD, whereas the DOA of the target speaker is estimated using the MUSIC algorithm. The output of this beamformer is processed using a spectral enhancement scheme combining statistical estimators of the speech, noise, and reverberant PSDs and aiming at joint residual reverberation and noise suppression. The evaluation of the proposed system, carried out using instrumental speech quality measures, a speech recognizer trained on clean data and subjective listening tests, illustrates the benefits of the proposed scheme.
References
J Benesty, J Chen, Y Huang, Microphone Array Signal Processing (Springer, Berlin, Germany, 2008).
S Gannot, I Cohen, in Springer Handbook of Speech Processing. Chap. 47, ed. by Benesty J, MM Sondhi, and Y Huang. Adaptive beamforming and postfiltering (SpringerBerlin, 2008).
Naylor PA, Gaubitch ND, Speech Dereverberation (Springer, Berlin, 2010).
B Cauchi, I Kodrasi, R Rehr, S Gerlach, Jukic, Á, T Gerkmann, S Doclo, S Goetze, in Proc. REVERB Challenge Workshop. Joint dereverberation and noise reduction using beamforming and a single-channel speech-enhancement scheme (Florence, Italy, 2014).
JS Bradley, H Sato, M Picard, On the importance of early reflections for speech in rooms. J. Acoust. Soc. Am. 113(6), 3233–3244 (2003).
R Maas, EAP Habets, A Sehr, W Kellermann, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 1. On the application of reverberation suppression to robust speech recognition (Kyoto, Japan, 2012), pp. 297–300.
EAP Habets, S Gannot, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), IV. Dual-microphone speech dereverberation using a reference signal (Honolulu, USA, 2007), pp. 901–904.
S Braun, EAP Habets, in Proc. European Signal Processing Conference (EUSIPCO). Dereverberation in noisy environments using reference signals and a maximum likelihood estimator (Marrakech, Morocco, 2013).
A Schwarz, K Reindl, W Kellermann, in Proc. Intl. Workshop Acoust. Signal Enhancement (IWAENC). On blocking matrix-based dereverberation for automatic speech recognition (Aachen, Germany, 2012), pp. 1–4.
A Schwarz, K Reindl, W Kellermann, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). A two-channel reverberation suppression scheme based on blind signal separation and wiener filtering (Kyoto, Japan, 2012), pp. 113–116.
A Kuklasiński, S Doclo, SH Jensen, J Jensen, in Proc. European Signal Processing Conference (EUSIPCO). Maximum likelihood based multi-channel isotropic reverberation reduction for hearing aids (Lisbon, Portugal, 2014), pp. 61–65.
S Wisdom, T Powers, L Atlas, J Pitton, in Proc. REVERB Challenge Workshop. Enhancement of reverberant and noisy speech by extending its coherence (Florence, Italy, 2014).
R Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9(5), 504–512 (2001).
T Gerkmann, RC Hendriks, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Trans. Audio Speech Lang. Process. 20(4), 1383–1393 (2012).
K Lebart, JM Boucher, PN Denbigh, A new method based on spectral subtraction for speech de-reverberation. Acta Acoustica. 87, 359–366 (2001).
EAP Habets, S Gannot, I Cohen, Late reverberant spectral variance estimation based on a statistical mode. IEEE Signal Process. Lett. 16(9), 770–774 (2009).
J Bitzer, KU Simmer, in Microphone Arrays, Digital Signal Processing, ed. by Brandstein M, Ward D. Superdirective microphone arrays (SpringerBerlin, 2001), pp. 19–38.
J Ramırez, JC Segura, C Benıtez, A De La Torre, A Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3), 271–287 (2004).
RO Schmidt, Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34(3), 276–280 (1986).
N Madhu, Acoustic source localization: Algorithms, applications and extensions to source separation (Ph.D, Thesis, Ruhr-Universität Bochum, May 2009).
HW Löllmann, P Vary, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). A blind speech enhancement algorithm for the suppression of late reverberation and noise (Taipei, Taiwan, 2009), pp. 3989–3992.
C Breithaupt, M Krawczyk, R Martin, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech (Las Vegas, Nevada, USA, 2008), pp. 4037–4040.
JD Polack, Playing billiards in the concert hall: the mathematical foundations of geometrical room acoustics. Appl. Acoustics. 38(2), 235–244 (1993).
C Breithaupt, T Gerkmann, R Martin, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). A novel a priori SNR estimation approach based on selective cepstro-temporal smoothing (Las Vegas, Nevada, USA, 2008), pp. 4897–4900.
T Gerkmann, R Martin, On the statistics of spectral amplitudes after variance reduction by temporal cepstrum smoothing and cepstral nulling. IEEE Trans. Signal Process. 57(11), 4165–4174 (2009).
K Kinoshita, M Delcroix, T Yoshioka, T Nakatani, EAP Habets, R Haeb-Umbach, V Leutnant, A Sehr, W Kellermann, R Maas, S Gannot, B Raj, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech (New Paltz, NY, USA, 2013).
K Kinoshita, M Delcroix, T Yoshioka, T Nakatani, EAP Habets, R Haeb-Umbach, V Leutnant, A Sehr, W Kellermann, R Maas, S Gannot, B Raj, Summary of the REVERB challenge (2014). [Online] Available: http://reverb2014.dereverberation.com/workshop/slides/reverb_summary.pdf. Accessed 07/07/15.
ITU (ITU-R), Recommendation BS.1534–3: Method for the Subjective Assessment of Intermediate Quality Levels of Coding Systems. Online, available at: http://www.itu.int/rec/R-REC-BS.1534-2-201406-I/en. access date 07/07/15.
H Cox, RM Zeskind, MM Owen, Robust adaptive beamforming. IEEE Trans. Acoust. Speech Signal Process. 35(10), 1365–1376 (1987).
J Eaton, ND Gaubitch, PA Naylor, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Noise-robust reverberation time estimation using spectral decay distributions with reduced computational cost (Vancouver, Canada, 2013), pp. 161–165.
IS Gradshteyn, IM Ryzhik, Table of Integrals, Series, and Products (Academic Press, Inc., Boston, 1994).
Y Ephraim, D Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984).
Y Ephraim, D Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985).
T Robinson, J Fransen, D Pye, J Foote, S Renals, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). WSJCAMO: a british english speech corpus for large vocabulary continuous speech recognition (Detroit, Michigan, USA, 1995), pp. 81–84.
M Lincoln, I McCowan, J Vepa, HK Maganti, in Proc. IEEE Workshop Autom. Speech Recognition and Understanding (ASRU). The multichannel Wall Street Journal audio–visual corpus (MC-WSJ-AV): Specification and initial experiments (Cancún, Mexico, 2005), pp. 357–362.
REVERB Challenge, Documentation about the room impulse responses and noise data used for the REVERB challenge SimData. [Online] Available: http://reverb2014.dereverberation.com/tools/Document_RIR_noise_recording.pdf, Accessed: June 27, 2015.
S Goetze, On the Combination of Systems for Listening-Room Compensation and Acoustic Echo Cancellation in Hands-Free Telecommunication Systems (PhD thesis, Dept. of Telecommunications, University of Bremen (FB-1), Bremen, Germany, 2013).
S Goetze, A Warzybok, I Kodrasi, JO Jungmann, B Cauchi, J Rennies, E Habets, A Mertins, T Gerkmann, S Doclo, B Kollmeier, in Proc. Intl. Workshop Acoust. Signal Enhancement (IWAENC). A study on speech quality and speech intelligibility measures for quality assessment of single-channel dereverberation algorithms (Antibes, France, 2014).
PC Loizou, Speech Enhancement Theory and Practice (Taylor & Francis, New York, 2007).
T Falk, C Zheng, W-Y Chan, A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Trans. Audio Speech Lang. Process. 18(7), 1766–1774 (2010).
Y Hu, PC Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16(1), 229–238 (2008).
ITU-T, Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-end Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs. Online, available at: https://www.itu.int/rec/T-REC-P.862-200102-I/en, access date 07/07/15.
S Young, G Evermann, M Gales, T Hain, D Kershaw, X Liu, G Moore, J Odell, D Ollason, D Povey, V Valtchev, P Woodland, The HTK Book, 3.4.1 edn (Cambridge University Engineering Dept, Cambridge, 2009). http://htk.eng.cam.ac.uk/prot-docs/HTKBook/htkbook.html.
M Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937).
JD Gibbons, S Chakraborti, Nonparametric Statistical Inference (Springer, Berlin, 2011).
Acknowledgements
The research leading to these results has received funding from the EU Seventh Framework Programme project DREAMS under grant agreement ITN-GA-2012-316969 as well as by the DFG-Cluster of Excellence EXC 1077/1, Hearing4all.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Cauchi, B., Kodrasi, I., Rehr, R. et al. Combination of MVDR beamforming and single-channel spectral processing for enhancing noisy and reverberant speech. EURASIP J. Adv. Signal Process. 2015, 61 (2015). https://doi.org/10.1186/s13634-015-0242-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13634-015-0242-x