Single-channel speech enhancement (SE) is used to dereverberate, as well as to de-noise the speech signal. Here, the MMSE estimator described in [21] is used to obtain the clean speech magnitude, which requires the PSDs of the speech signal and the interference. It has been shown, e.g., in [3, 18], that attenuating late reverberation is crucial for ASR while early reflections can be mitigated well by, e.g., cepstral mean subtraction (CMS) [35]. Hence, the interference signal is considered to contain late reverberation and the noise, while the desired speech signal is formed by the direct path and some early reflections.
2.1 Structure of the SE algorithm
The recorded microphone signal y[k] consists of the reverberant speech signal x[k] and the noise n[k],
$$ {y}[k] = {x}[k] + {n}[k] \,, $$
((1))
with k denoting the discrete time index. The reverberant speech signal x[k] can be modeled as the convolution of the clean (anechoic) speech signal s[k] and the RIR h[k]. The reverberant signal x[k] can be split into the clean speech signal s[k] and the residual reverberation x
r[k] as
$$\begin{array}{@{}rcl@{}} {x}[k] = {s}[k] \ast {h}[k] = \sum\limits_{k^{\prime}=0}^{\infty}\, {s}[k-k{\prime}] \cdot {h}[k{\prime}] \\= s[k]+ \underbrace{\sum\limits_{k{\prime}=1}^{\infty}\, {s}[k-k{\prime}] \cdot {h}[k{\prime}]}_{={x}_{\mathrm{r}}[k]} \,, \end{array} $$
((2))
where ∗ denotes the convolution. Furthermore, x
r[k] can be decomposed into early reflections x
e[k] and late reverberation x
l[k] separated at sample K
e (usually up to about 50 ms of h[k] [25], i.e., K
e=⌊f
s
·50ms⌋ at a sampling frequency f
s
) as
$$ {x}_{\mathrm{r}}[k] = \underbrace{\sum\limits_{k{\prime}=1}^{K_{\mathrm{e}}}\, {s}\left[k-k{\prime}\right] \cdot {h}\left[k{\prime}\right]}_{={x}_{\mathrm{e}}[k]} + \underbrace{\sum\limits_{k{\prime}=K_{\mathrm{e}}+1}^{\infty}\, {s}\left[k-k{\prime}\right] \cdot {h}\left[k{\prime}\right]}_{={x}_{\mathrm{l}}[k]} \,. $$
((3))
With (2) and (3), (1) can be rewritten as,
$$ {y}[k] = {s}[k] + {x}_{\mathrm{r}}[k] + n[k] = \underbrace{{s}[k] + {x}_{\mathrm{e}}[k]}_{={y}_{\mathrm{d}}[k]} + \underbrace{{x}_{\mathrm{l}}[k] + n[k]}_{={y}_{\mathrm{i}}[k]} \,, $$
((4))
with y
d[k] being the desired signal part which contains the direct-path signal s[k] and early reflections x
e[k]. The interference component y
i[k] consists of late reverberation x
l[k] and the noise n[k]. Note that the noise n[k] considered here is mostly stationary and was recorded under the same reverberation conditions as the RIR measurement [8]. The goal is to find an estimate of the desired speech component \(\hat {y}_{\mathrm {d}}[k]\) and, by this, to reduce the interference y
i[k].
The proposed speech enhancement algorithm depicted in Fig. 1 operates in the short-time Fourier transform (STFT) domain. Uppercase variables, e.g., Y[m,ℓ], X
l[m,ℓ], and Y
d[m,ℓ], denote the STFT representations of y[k], x
l[k], and y
d[k], respectively, with m and ℓ representing the frequency bin and the temporal frame index. First, we need to estimate the background noise (cf. Section 2.2, please note that no reverberation is included in the noise PSD estimate) to obtain an estimate of the reverberant speech from TCS (cf. Section 2.3). This PSD \(\hat {{\lambda }}_{{X}}[m,\ell ]\) is then used to obtain an estimate of late reverberation based on the RIR model from [26] (cf. Section 2.4), which allows for estimating the clean speech PSD \(\hat {{\lambda }}_{{Y}_{\mathrm {d}}}[m,\ell ]\) by applying TCS again. Then, the MMSE estimator (cf. Section 2.5) is applied to obtain an estimate of the clean speech signal \(\hat {{Y}}_{\mathrm {d}}[m,\ell ]\), which is then transferred back to the time domain by inverse STFT (ISTFT) and subsequently forwarded to the ASR system. Additionally, a joint (T
60,DRR) estimator using an MLP network (cf. Section 2.6) is proposed as input for the late reverberation PSD estimate \(\hat {{\lambda }}_{{X}_{\mathrm {l}}}[m,\ell ]\).
2.2 Noise PSD estimation
The noise PSD λ
N
[m,ℓ] is estimated using an adapted version of the well-known MS method [22]. MS offers accurate noise PSD estimation especially if the noise signal is stationary to a certain extent, i.e., varies slowly compared to the statistics of the desired speech component, which is true for the noise contained in the REVERB challenge data set. It is assumed that the minima in this PSD originate from time-frequency bins that do not contain speech. These minima are tracked using a search window spanning usually 1.5 s of the estimated input PSDs in [22]. However, to ensure that no reverberation (which introduces the decay tail to the speech pauses) leaks into the noise PSD estimate, we enlarged this search window to 3 s [36].
2.3 Speech PSD estimation
We employ temporal cepstrum smoothing (TCS) [23, 37] for estimating the reverberant and the clean speech PSDs. This approach smoothes the maximum likelihood (ML) estimate of the clean or reverberant speech PSD over time in the cepstral domain. Due to the compact representation of speech in the cepstral domain, speech-related and non-speech-related coefficients can be selectively smoothed. Compared to other approaches, e.g., the decision-directed approach [38], TCS is able to reduce musical noise artifacts, which is crucial for ASR systems [19].
First, the PSD of the reverberant speech λ
X
[m,ℓ]=E{|X[m,ℓ]|2} (E{·} is the expectation operator) is obtained by the ML estimate [38],
$$\begin{array}{@{}rcl@{}} \hat{{\lambda}}^{\text{ml}}_{{X}}[\!m,\ell]\,=\,\max\!\left(\!|{Y}[\!m,\ell]|^{2} \- \hat{{\lambda}}_{{N}}[\!m,\ell], \, {{\xi}}_{\min} \cdot {\hat{{{\lambda}}}_{{N}}}[\!m,\ell]\!\! \right)\!, \end{array} $$
((5))
where ξ
min is the lower bound of the a priori signal-to-noise-ratio. Then, the cepstral representation of the above ML estimate is calculated as
$$ {\hat{\lambda}_{{X}}^{c,\text{ml}}}[q,\ell] = {\mathcal{F}}^{-1} \left\{\ln \left({\hat{{{\lambda}}}^{\text{ml}}_{{X}}}[m,\ell] \right) \right\}\,, $$
((6))
where the superscript c denotes the cepstral domain and q represents the cepstral or quefrency index. \(\mathcal {F}^{-1}\{\cdot \}\) denotes the inverse discrete Fourier transform (IDFT). After that, smoothing is applied to (6), i.e.,
$$ \hat{{\lambda}}_{{X}}^{c}[q,\ell] = {\mathbf{\alpha}^{c}}[q,\ell] \cdot \hat{\lambda}_{{X}}^{c}[q, \ell-1] + ({{1}}-{{{\alpha}}^{c}}[q,\ell]) \cdot \hat{\lambda}_{{X}}^{c,\text{ml}}[q,\ell]\,, $$
((7))
where α
c[q,ℓ] represents a quefrency-dependent smoothing coefficient, which should be chosen such that the coefficients relevant for speech production are maintained while the remaining coefficients are strongly smoothed [23]. Thus, in an SE framework, usually α
c[q,ℓ] is chosen small for the speech spectral envelope represented by the low quefrencies and the fundamental period peak in the cepstrum [37]. In contrast to SE, preserving the fundamental frequency is not crucial for ASR systems [39], and α
c[q,ℓ], thus, is chosen as,
$$\begin{array}{@{}rcl@{}} {\alpha}^{c}[q,\ell] =\left\{\begin{array}{ll} 0.0 & \text{for}\, \, q = 0,\ldots, \left\lceil f_{s}\cdot 0.5\,\, \text{ms} \right\rceil -1 \,, \\ 0.5 & \text{for}\, \, q = \left\lceil f_{s}\cdot 0.5\,\, \text{ms} \right\rceil,\ldots, \left\lceil f_{s}\cdot 1\,\, \text{ms} \right\rceil -1\,, \\ 0.9 & \text{for}\, \, \text{otherwise.} \end{array} \right.\\ \end{array} $$
((8))
Note that the application range of q is only given for the lower half of the cepstrum, which will be applied accordingly to the symmetric counterpart. Finally, the reverberant speech PSD estimate \(\hat {\lambda }_{{X}}[m,\ell ]\) is achieved after transforming (7) back to the frequency domain,
$$ \hat{\lambda}_{{X}}[m,\ell] = {b} \cdot \exp \left(\mathcal{F} \left\{ \hat{\lambda}_{{X}}^{c}[q,\ell] \right\} \right) \,, $$
((9))
where \(\mathcal {F}\{\cdot \}\) represents the DFT operator. The factor b is a function of the smoothing factor α
c[q,ℓ] in (8) and compensates for the bias caused by the averaging in the cepstral domain [23]. For a detailed discussion of b the reader is referred to [23].
2.4 Late reverberation PSD estimation
In our workshop paper [12], Polack’s statistical RIR model [3, 24] has been used to achieve reverberation suppression based on an estimate of the late reverberation PSD \({\lambda }_{{X}_{\mathrm {l}}}[m,\ell ]= \mathrm {E}\,\left \{|{X}_{\mathrm {l}}[m,\ell ]|^{2}\right \}\). Please note, that for simplicity, the frequency bin m will be omitted in the following descriptions within Section 2, since the spectral bins are assumed to be independent. By using the reverberant speech PSD estimate \(\hat {\lambda }_{{X}}[\ell ]\) obtained from (9), as well as the separation of the early and late part in (2)-(4), the late reverberation PSD estimate can be calculated by [24]
$$ \hat{\lambda}_{{X}_{\mathrm{l}}}[\ell] = \exp(-2 {{\rho}} \tau_{\mathrm{s}} L_{\mathrm{e}}) \cdot \hat{\lambda}_{{X}}\left[\ell - L_{\mathrm{e}}\right]\,, $$
((10))
where the parameter L
e is a number of frames which corresponds to the duration of early part of the RIR (cf. K
e in samples in (3)). Consequently, L
e·τ
s is the start time of late reverberation (which is fixed to 50 ms here), and τ
s is the STFT time shift (hop size in s). ρ is the decay rate related to the reverberation time T
60, i.e., ρ=3 ln(10)/T
60. In [12], blind reverberation time T
60 estimation was achieved by the method proposed in [40], which is based on spectral decay distributions of the observed speech signal and is shown to be robust against additive noise when a noise PSD estimator is appended.
Considering reverberant situations where the speaker-microphone distances are smaller than the critical distance, i.e., those with positive DRRs [25], the statistic reverberation model proposed in [26] is used here which separates the direct path from Polack’s RIR model as used in [24], defined with the spectral variance λ
H
[ℓ] of the RIR h[k] in the STFT domain as
$$\begin{array}{@{}rcl@{}} \mathbf{\lambda}_{H}[\ell] = \left\{ \begin{array}{ll} \beta_{\mathrm{d}} & \text{for}\, \, \ell = 0 \,, \\ \beta_{\mathrm{r}} \exp ({-2 \rho \tau_{\mathrm{s}} \ell}) & \text{for}\, \, \ell \geq 1\,, \end{array} \right. \end{array} $$
((11))
where β
d and β
r denote the variances of the direct path and the residual reverberation part, respectively. Accordingly, the relationship to the DRR is given by [26]
$$ \text{DRR} = 10 \log_{10} \left(\frac{1-\exp({-2 \rho \tau_{\mathrm{s}}})}{\exp ({-2 \rho \tau_{\mathrm{s}}})} \cdot \frac{\beta_{\mathrm{d}}}{\beta_{\mathrm{r}}} \right) \,. $$
((12))
Using (11), the reverberant speech PSD can be computed by [26]
$$\begin{array}{@{}rcl@{}} \hat{\lambda}_{{X}_{\mathrm{r}}}[\ell] &=& (1-\kappa) \cdot \exp ({-2 \rho \tau_{\mathrm{s}}})\hat{\lambda}_{{X}_{\mathrm{r}}}[\ell-1]\\ &&+\; \kappa \cdot \exp ({-2 \rho \tau_{\mathrm{s}}}) \hat{\lambda}_{X}[\ell-1]\,, \end{array} $$
((13))
where κ=β
r/β
d is calculated from the DRR in (12), constraint in the range of (0, 1]. Then, the late reverberation PSD from (10) is modified to
$$ \hat{\lambda}_{{X}_{\mathrm{l}}}[\ell] = \exp ({-2 \rho \tau_{\mathrm{s}} (L_{e}-1)}) \cdot \hat{\lambda}_{{X}_{\mathrm{r}}}[\ell-L_{e}+1]. $$
((14))
If κ equals 1, then (14) is equivalent to (10), which shows that this approach is the same as the approach described in [24] under this condition. It has been shown in [41] that (14) provides a more reliable late reverberation PSD estimate so that less speech distortions are achieved, which is a benefit for the ASR system. A disadvantage of this method is that it requires not only T
60 but also the DRR. In other words, the reliable estimation of these two parameters plays a crucial role for the late reverberation PSD estimate, which here can be obtained by an MLP estimator described in Section 2.6.
2.5 MMSE estimator
After estimating the noise and late reverberation PSDs, the interference PSD can be obtained from (4) in a straightforward way as
$$ \hat{\lambda}_{{Y}_{\mathrm{i}}}[\ell] = \hat{\lambda}_{{X}_{\mathrm{l}}}[\ell] + \hat{\lambda}_{{N}}[\ell]\,, $$
((15))
assuming that the late reverberant signal x
l[k] and the noise n[k] are uncorrelated. \(\hat {\lambda }_{{Y}_{\mathrm {i}}}[\ell ]\) will be used to estimate the PSD of the desired speech component Y
d[ℓ] by another TCS procedure as depicted in the lower branch of Fig. 1. To achieve this, as aforementioned in Section 2.3, the input noise PSD \(\hat {\lambda }_{{N}}[\ell ]\) in (5)-(9) is replaced by \(\hat {\lambda }_{{Y}_{\mathrm {i}}}[\ell ]\). As we now use the PSD of the interference signal \(\hat {\lambda }_{{Y}_{\mathrm {i}}}[\ell ]\), i.e., the noise and late reverberation, TCS will estimate the PSD of the clean speech signal and early reflections \(\hat {\lambda }_{{Y}_{\mathrm {d}}}[\ell ]\).
In the final step, a parameterized MMSE spectral magnitude estimator [21] is used to determine the weighting function G[ℓ] to obtain the enhanced speech signal \(\hat {Y}_{\mathrm {d}}[\ell ]\). A simplified, computationally less complex version [42] based on the confluent hypergeometric function [43] is used, which is defined as
$$\begin{array}{@{}rcl@{}} {} {G}[\ell] &=& \left(\frac{1}{{1}+{\nu}[\ell]}\right)^{{p_{0}}} \cdot {G}_{0}[\ell] \end{array} $$
((16))
$$\begin{array}{@{}rcl@{}}&&+ \left(\frac{{\nu}[\ell]}{{1}+{\nu}[\ell]} \right)^{{p_{\infty}}} \cdot \frac{\hat{{\xi}}[\ell]}{\mu + \hat{{\xi}}[\ell]} \, \\ {}{G}_{0}[\ell] &=& \left(\frac{{\Gamma}({{\mu}}+{{\gamma}}/2)}{{\Gamma}({\mu})} \right)^{1/{{\gamma}}} \!\cdot \left(\frac{\hat{{\xi}}[\ell]}{ {{\mu}}+\hat{{\xi}}[\ell]} \cdot \frac{1}{\hat{{\zeta}}[\ell]} \right)^{1/2}, \end{array} $$
((17))
$$\begin{array}{@{}rcl@{}} {}{\nu}[\ell] &=& \frac{\hat{{\xi}}[\ell]}{{{\mu}}+\hat{{\xi}}[\ell]} \cdot \hat{{\zeta}}[\ell] \,, \end{array} $$
((18))
with Γ(·) being the complete gamma function. The estimates of the a priori and a posteriori desired-signal-to-interference-ratios are defined as \(\hat {{\xi }}[\ell ] = \hat {{\lambda }}_{{Y}_{\mathrm {d}}}[\ell ] / \hat {\lambda }_{{Y}_{\mathrm {i}}}[\ell ]\), and \(\hat {{\zeta }}[\ell ] = |{Y}[\ell ]|^{2} / \hat {\lambda }_{{Y}_{\mathrm {i}}}[\ell ]\), respectively. The constant parameters μ and γ can be tuned to yield several types of estimators. In [21], μ=0.5 and γ=0.5 have been identified as a good compromise between the amount of musical noise and the clarity of speech and are, therefore, also applied here. For obtaining the correct approximation for the selected values of μ and γ, the exponents p
0 and p
∞
in (16) have to be set to 0.5 and 1.0, respectively [42]. Subsequently, the estimated desired signal \(\hat {Y}_{\mathrm {d}}[\ell ]\) is calculated by
$$ \hat{Y}_{\mathrm{d}}[\ell] = \max({G}[\ell], {{G}_{\min}}) \cdot {Y}[\ell] \,, $$
((19))
with G
min being a lower bound for the weighting function G[ℓ] which alleviates speech distortions, however, also limits the possible amount of interference suppression. In conformance with [44], G
min=−10 dB is chosen as a good value to improve the ASR performance in reverberant environments. Then, as illustrated in Fig. 1, an ISTFT is conducted to reconstruct the output speech signal in the time domain \(\hat {y}_{\mathrm {d}}[k]\) used for the subsequent ASR experiments.
2.6 Estimation of room parameters (T
60
,DRR)
A novel approach to jointly estimate T
60 and DRR is proposed here based on our previous work [27, 28]. An overview of the estimation process is presented in Fig. 2: In a first step, reverberant signals are converted to spectro-temporal Gabor filterbank (GBFB) features [33, 45, 46] to capture information relevant for room parameter estimation. For details on GBFB selection, the reader is referred to [27]. A multi-layer perceptron (MLP) classifier, belonging to the class of feedforward artificial neural network models [47], is trained to map the input pattern to pairs of (T
60,DRR) values. Since the MLP generates one estimate per time step, we obtain an utterance-based estimate by simple temporal averaging and subsequent selection of the output neuron with the highest average activation (winner-takes-all approach). The MLP was implemented with the freely available QuickNet package [48] and has three layers. The output layer corresponds to the (T
60,DRR) pairs.
These pairs were defined based on the RIRs provided by the training data of the REVERB challenge. Figure 3 shows the distribution of (T
60,DRR) values for the given RIRs. The bounding boxes in the figure denote the categorical boundaries for the classes. We defined 28 classes as a compromise between a large number of classes (with the potential of more accurate (T
60,DRR) classification, but only few training examples for each class) and few classes (with coarse classification, but many training examples).
The T
60 values are obtained using Schroeder’s method [49], which formulates a poly-fit in the range between −35 to −5 dB of the RIR accumulation energy. The DRR in dB is calculated as
$$ \text{DRR} = 10\, \log_{10} \, \frac{\sum_{k=0}^{K_{\mathrm{d}}} |h[k]|^{2} }{ \sum_{k=K_{\mathrm{d}}+1}^{\infty} |h[k]|^{2}} \,, $$
((20))
where K
d represents the sample length of the direct sound arrival, which is usually measured as a short time period after the onset of the RIR. Here we take the maximum value of h[k] as the onset of the RIR, and the following 0.5 ms range as the direct path samples, i.e., K
d=⌊f
s
·0.5 ms⌋.