Feature enhancement of reverberant speech by distribution matching and non-negative matrix factorization

Keronen, Sami; Kallasjoki, Heikki; Palomäki, Kalle J.; Brown, Guy J.; Gemmeke, Jort F.

doi:10.1186/s13634-015-0259-1

Research
Open access
Published: 20 August 2015

Feature enhancement of reverberant speech by distribution matching and non-negative matrix factorization

Sami Keronen¹,
Heikki Kallasjoki¹,
Kalle J. Palomäki¹,
Guy J. Brown² &
…
Jort F. Gemmeke³

EURASIP Journal on Advances in Signal Processing volume 2015, Article number: 76 (2015) Cite this article

2350 Accesses
Metrics details

Abstract

This paper describes a novel two-stage dereverberation feature enhancement method for noise-robust automatic speech recognition. In the first stage, an estimate of the dereverberated speech is generated by matching the distribution of the observed reverberant speech to that of clean speech, in a decorrelated transformation domain that has a long temporal context in order to address the effects of reverberation. The second stage uses this dereverberated signal as an initial estimate within a non-negative matrix factorization framework, which jointly estimates a sparse representation of the clean speech signal and an estimate of the convolutional distortion. The proposed feature enhancement method, when used in conjunction with automatic speech recognizer back-end processing, is shown to improve the recognition performance compared to three other state-of-the-art techniques.

1 Introduction

Automatic speech recognition (ASR) is becoming an effective and versatile way to interact with modern machine interfaces. However, in order to successfully adopt ASR in any practical application, high robustness to non-stationary speaker and environmental factors is required. While many noise-robust ASR techniques have been shown to meet the demands of specific applications (e.g., mobile communication), they often fail in more complex scenarios such as in the presence of room reverberation.

Recently, conventional Gaussian mixture model (GMM) and hidden Markov model (HMM)-based ASR systems have been superseded by hybrid multilayer-perceptron (MLP)-HMM systems [1], often referred to as deep neural network (DNN) systems. Despite all the successes obtained with DNNs, attributed to their ability to learn from large amounts of potentially noisy data, investigations have shown DNN systems can be quite sensitive to mismatched environments. For instance in [2], it was shown that even with state-of-the-art DNN systems, front-end processing is helpful in increasing ASR performance in mismatched conditions.

Previous studies have attempted to counteract the convolutional distortion caused by reverberation using a number of denoising methods, such as frequency domain linear prediction [3], modulation filtered spectrograms [4], or missing-data mask estimation designed for dereverberation [5]. All of these approaches make weak assumptions about the reverberant data (e.g., they do not require that the room impulse response is known) but they achieve only a moderate increase in ASR performance. More recent techniques include MLP-based feature enhancement systems; for example, a deep recurrent neural network (RNN) approach for log-spectral domain feature enhancement was recently proposed in [6] and applied to dereverberation. Similarly, an RNN exploiting long-range temporal context by using memory cells in the hidden units was applied to dereverberation in [7]. A further example is the reverberation modeling for speech recognition (REMOS) framework [8], which combines a clean speech model with a reverberation model to determine clean speech and reverberation estimates during recognition via a modified Viterbi algorithm. In conditions with relatively long reverberation times, REMOS provides higher recognition accuracy than a matched model.

This article focuses on one of the most powerful approaches for denoising of recent years—non-negative matrix factorization (NMF)—which models the speech spectrogram as a sparse non-negative linear combination of dictionary elements (“speech atoms”). NMF was formulated in [9] to decompose multivariate data and has been the basis of several sound source separation [10, 11] and denoising [12] systems. Noise robust ASR systems based on NMF were introduced in [13], using either feature enhancement or hybrid HMM decoding with so-called sparse classification. An alternative formulation of NMF, non-negative matrix factor deconvolution (NMFD), was introduced in [14] to take better advantage of temporal information. NMFD lends itself naturally to dereverberation; [15, 16] describe methods for blind dereverberation by decomposing the reverberated spectrum into a clean spectrum convolved with a filter, while constraining the properties of the speech spectrum.

Our previous work, published in two papers in the REVERB’14 workshop [17, 18], described two dereverberation techniques that are combined and extended in the current study. In the first paper [17], a technique was described for speech dereverberation that draws on the fundamental idea of NMF, in that it models speech as a linear combination of dictionary elements. However, the NMF-based approach was extended to incorporate a filter in the Mel-spectral domain that could be optimized for arbitrary convolutions. Furthermore, [17] used missing-data mask imputed (MDI) [19, 20] spectrograms to produce the initial estimate of the sparse representation of the clean speech signal, giving more effective dereverberation. Our second REVERB’14 workshop paper proposed a distribution matching (DM) scheme for unsupervised dereverberation of speech features [18]. This utilizes stacked and decorrelated spectro-temporal vectors containing a long time context. In the decorrelated transformation domain, the distributions of reverberant supervectors are equalized to match the a priori clean speech distribution by applying a non-parametric histogram equalization-based approach [21].

Bringing the ideas in our two workshop papers together, the current paper proposes a novel dereverberation feature enhancement method in a noise-robust ASR framework by combining the NMF and DM methods — a combination that was not tested in either of the workshop papers. More specifically, we present a single-channel source separation technique which extracts the speech signal from the observed mixture of speech and noise signals and train the ASR back-end with the enhanced (dereverberated) features to increase the recognizer tolerance for artifacts generated in denoising. Our previous work [17, 18] shows that DM outperforms MDI as a feature enhancement strategy. This brings us to the goal of the present study: to investigate whether the performance advantage of DM translates into better initial estimates of the sparse representation of the dereverberated speech features, compared to that obtained with MDI. The proposed method is evaluated on the reverberant 2014 REVERB Challenge data set [22] and shown to provide equal or higher ASR performance than three existing state-of-the-art feature enhancement methods, using similar back-end processing provided by the Kaldi toolkit [23]. Among the methods compared against our new approach, we include the RNN-based feature enhancement, a feature enhancement based on blind reverberation time estimation, and our previous system which used MDI to produce the initial clean speech estimate.

The remainder of the paper is structured as follows. Section 2 gives an overview of the proposed two-stage feature enhancement process. Sections 3 and 4 define the DM- and MDI-based initializations, used to estimate the initial sparse representation of the clean speech signal for NMF. Section 5 describes the procedure for non-negative matrix factorization of reverberant speech. Section 6 gives an overview of the experimental setup including the data set, the ASR system, parameter optimization, and brief descriptions of an additional multichannel feature enhancement and the computational requirements of the two-stage feature enhancement. Our results are presented and discussed in Sections 7 and 8, and conclusions from the study are presented in Section 9.

2 Overview of the dereverberation process

The flowchart of the dereverberation process and the overall ASR system is shown in Fig. 1. First, the speech signal is pre-processed (denoted by 1. Pre-processing) into frames of Mel-scale filterbank energies, which are used as an input to the NMF part of the feature enhancement (denoted by 2. Feature enhancement) for dereverberation. Conventional NMF feature enhancement would then be initialized with the previously described reverberant speech, but our implementation divides the feature enhancement into two stages: in the first stage, we construct an initial estimate of the non-reverberant speech that is used to initialize the NMF algorithm in the second stage. The factorization algorithm is initialized either with DM (described in Section 3) or MDI (briefly described in Section 4) dereverberated speech. The ASR back-end (denoted by 3. Back-end in the figure) consists of either GMM- or DNN-based acoustic modeling of enhanced and transformed features and an HMM-based decoder.

3 Distribution matching initialization

The goal of the distribution matching (DM) method [18] is to recover the clean speech spectra x from the observed reverberant speech spectra y when the clean speech prior distribution p(x) is assumed to be known and the distribution of the observed reverberant speech p(y) can be estimated during recognition. As our goal is to counteract the effects of reverberation, it is important to take into account the long time span in the effects of reverberation. In the following, we develop a method to map the distribution of reverberant speech observations to the clean speech prior. The DM method is also illustrated in Fig. 2. The method uses long time contexts decorrelated by linear transformations, after which a histogram equalization (HEQ) mapping can be utilized using one-dimensional distribution samples. HEQ was originally proposed for image processing [24] but subsequently has also been utilized in ASR to counteract noise and speaker changes over short temporal windows [21]. With a longer temporal context, as in the present study, HEQ has been used for feature space Gaussianization [25] to obtain a feature space that is easier to model with GMMs.

DM utilizes three steps that are applied in two iterations. The first step of the method is to find a signal representation that has a sufficiently long time context to counteract the effects of reverberation. Assuming that the effects of reverberation are linear and convolutive with the speech signal, we can represent them in the feature domain as linear transformations. Our basis features are C-dimensional Mel-spectral feature vectors of observed speech y that have been normalized to compensate for spectral distortion. The normalization is performed by estimating the reverberation-free spectral peaks to compute the normalization coefficients [5]. In the first iteration round, the observed speech y corresponds to original reverberated speech, whereas in the second iteration round, we use the dereverberated estimate as the observation $\boldsymbol {\mathsf {y}}=\hat {\boldsymbol {\mathsf {x}}}$. To take into account the duration of reverberation, the Mel-spectral observations are stacked over T consecutive frames to form CT-dimensional supervectors

$$ \boldsymbol{y}_{t}=\left[\boldsymbol{\mathsf{y}}_{t}^{\top}\ \dots\ \boldsymbol{\mathsf{y}}_{t+T-1}^{\top}\right]^{\top}, $$

((1))

where T is chosen large enough compared to the duration of the room impulse responses (RIRs) and ⊤ indicates transpose. Consequently, the speech features y affected by convolution can be formulated as y≈H x, where H is a filter matrix that performs convolution on the supervector x constructed from clean speech features.

The second step is to find a transformed feature domain that allows the use of one-dimensional mapping functions from the observed feature distribution to the non-reverberant target distribution. The supervector-based feature vectors x and y are highly correlated along the feature dimension because each vector includes spectral and temporal context, which introduces problems. In order to map such highly correlated features from the observed to the non-reverberant distribution, a complex multivariate mapping would be needed. However, the problem can be simplified by applying a decorrelating linear transformation to the spectral-temporal supervectors, after which it is possible to perform one-dimensional mappings. In this study, the applied transformation D is based on principal component analysis (PCA) to decorrelate the elements of the speech feature supervectors on a log-scale,

$$ \boldsymbol{g_{y}'}=\boldsymbol{D} \log \boldsymbol{y} \approx \boldsymbol{D} \log \boldsymbol{H} \boldsymbol{x}, $$

((2))

where y corresponds to reverberant speech in the first iteration and to the dereverberated speech estimate $\boldsymbol {y}=\hat {\boldsymbol {\boldsymbol {x}}}$ in the second iteration. The quantity gy′ denotes the observed speech supervector features in the decorrelated feature space, and the log operation is computed elementwise. The number of retained low-order principal components M of D can be treated as a tunable free parameter to obtain a more or less smoothed representation.

The third step is to develop the one-dimensional mapping functions that can be applied elementwise in the decorrelated feature domain. First, we make an assumption that the transformation D that decorrelates the non-reverberant speech supervectors x in the estimation of clean speech prior distribution also decorrelates all the observed speech supervectors y regardless of the extent of reverberation. Then, we can formulate one-dimensional elementwise bijective (one-to-one) mappings $F_{\textit {yx}}^{(m)}$ from PCA-transformed reverberant supervector elements g y′(m) to dereverberated ones $\tilde {g}_{x}'(m)$ as follows

$$ \tilde{g}_{x}'(m) = F_{yx}^{(m)}\left(g_{y}'(m)\right), $$

((3))

where m indexes the mapping for each feature element. As the PCA-transformed supervectors gy′ represent sufficient temporal context relative to reverberation effects, it is possible to find effective mappings from reverberant speech to clean speech (see [18]).

In this work, functions for $F_{\textit {xy}}^{(m)}$ are obtained by mapping the distribution of observed speech to match the distribution of the clean speech prior. In the first iteration step, we use the original reverberant speech as observations, and in the second step, we use the dereverberated estimate from the first iteration round. The mapping is easy to find if the distributions of clean and observed speech are represented by inverse cumulative distribution functions (ICDF) [21, 25]. In general, the empirical ICDF $\Phi _{y}^{-1}$ can be obtained simply by scaling and sorting the data samples. In our case, however, we omit the scaling as the data has already been equalized for spectral deviation. From now on, we simplify the notation and operate on individual components of the decorrelated supervectors by dropping all indices m. The mapping function F _yx from reverberant speech ICDF $\Phi _{y}^{-1}$ to clean speech ICDF $\Phi _{x}^{-1}$ is implemented by constructing a lookup table $\Phi _{y}^{-1} \xrightarrow [F]{} \Phi _{x}^{-1}$ with piecewise cubic Hermite interpolation (Section 3.3. in [26]). When applied in practice, the lookup table needs to be updated to reflect the current reverberation condition encountered during recognition. Assuming that reverberation conditions change slowly, a sample of reverberant data is collected during recognition to model reverberant distribution $\Phi _{y}^{-1}$, which is the mapping input data distribution. While the input data distribution needs updating, the mapping target distribution $\Phi _{x}^{-1}$ is always represented using the same static clean speech sample. In the present study, the mapping input distribution is updated during recognition passes by using batches of development or test-set data. Each batch corresponds to a static reverberation condition in the REVERB Challenge data, described in Section 6.1.

We can now produce the estimate of the dereverberated log-spectral supervector $\tilde {\boldsymbol {x}}'$ as

$$ \tilde{\boldsymbol{x}}' = \boldsymbol{D}^{-1}F_{yx}\left(\boldsymbol{g_{y}'}\right), $$

((4))

where the mapping F _yx is realized using separate lookup tables $F_{\textit {yx}}^{(m)}$ for each element m of gy′ and D ⁻¹ is the inverse PCA transformation. Then, supervectors $\tilde {\boldsymbol {x}}'$ are unstacked to the linear Mel-spectral domain $\tilde {\boldsymbol {\mathsf {x}}}$ with one frame time context using overlap adding, so that regions in adjacent supervectors containing Mel-spectra of the same time frame are averaged. Thus, linear Mel-spectral vectors $\tilde {\boldsymbol {\mathsf {x}}}$ are obtained as

$$ \tilde{\boldsymbol{\mathsf{x}}} = \exp\left(1/T\sum_{t=1}^{T} \tilde{\boldsymbol{x}}'_{T-t}(\psi)\right), $$

((5))

where t indexes both the frames of adjacent supervectors and also the component Mel-spectral vectors within the range ψ=[(t−1)C+1,…,t C] in each supervector.

However, the dereverberated feature estimates $\tilde {\boldsymbol {\mathsf {x}}}$ in this form are smoothed by the PCA and averaging operators. Therefore, we apply a Wiener filter to reintroduce some short-term variation that was present in the original reverberant observations y but was removed by the smoothing. For the Wiener filter, we also need a version of the reverberated features $\tilde {\boldsymbol {\mathsf {y}}}$ that were smoothed by the same PCA transformation D. The Wiener-filtered feature estimate $\hat {\boldsymbol {\mathsf {x}}}$ is given by

$$ \widehat{\boldsymbol{\mathsf{x}}}=\overset{\sim }{\boldsymbol{\mathsf{x}}}\kern1em ./\overset{\sim }{\boldsymbol{\mathsf{y}}}\kern1em .\kern0.3em \ast \boldsymbol{\mathsf{y}} $$

((6))

where./ denotes elementwise division and.∗ elementwise multiplication. The importance of Wiener filtering is demonstrated in our previous work [18].

After progressing through the above three steps (Eqs. (1)–(6)) in the first iteration, the reverberant observation y is substituted with the current estimate $\hat {\boldsymbol {\mathsf {x}}}$. After the second iteration, we obtain the estimates $\hat {\boldsymbol {\mathsf {x}}}$ that are used either directly as enhanced features or as initialization estimates for the NMF processing.

4 Missing data imputation initialization

The missing data imputation method used here utilizes the bounded conditional mean imputation (BCMI) as proposed in [19]. The method uses a GMM model to capture the clean speech statistics for reconstructing the unreliable noisy regions of the observed speech spectrum. Here, we denote the noise-free reliable part of the speech spectrum by x _r and the noisy unreliable part by x _u. The BCMI produces the clean speech estimate $\hat {\mathbf {x}}_{\mathrm {u}}$ using the conditional distribution p(x _u∣x _r) with an assumption that the observed noisy speech x _u acts as the upper bound for the underlying clean speech.

For estimating the missing data mask that specifies the reliable and unreliable regions, we use the approach proposed in [27]. The method uses a modulation band-pass filter along the time trajectory, tuned to the speech syllable rate. The filter emphasizes reverberation-free speech onsets so that they can be distinguished from reverberant segments of speech. Regions which are emphasized by the filter are labeled reliable, while regions that are de-emphasized are labeled as unreliable.

5 Non-negative matrix factorization of reverberant speech

Methods based on the non-negative matrix factorization (NMF) framework have been widely used for various speech processing tasks. A typical application of NMF is noisy speech feature enhancement via supervised source separation [13]. Given a pre-set dictionary of fixed-size magnitude spectrogram atoms of both speech and noise, an observed spectrogram is modeled by their non-negative linear combination. The individual reconstructions of both clean speech and noise spectrograms are based on estimating the corresponding dictionary atoms and their coefficients in the NMF representation. To account for observations of arbitrary length, the processing can be performed in (overlapping) windows of a fixed length of T frames.

In this work, we consider reverberant but relatively noise-free speech. Hence, we do not make use of the noise dictionary but still build on the same underlying speech model. We denote by Y the observed speech, represented by a T C×N matrix. Each column of Y is a collection of T frames of a C-dimensional Mel-scale spectrogram, stacked into a single vector. Under the NMF model, we have the approximation

$$ \mathbf{Y} \approx \mathbf{S}\mathbf{A}, $$

((7))

where S is a T C×K dictionary matrix of K spectrograms of clean speech, while A is the K×N activation matrix holding the linear combination coefficients.

The effect of reverberation extends across frame boundaries in the Mel-spectrogram domain. This can be approximated by a convolution of the samples of each frequency channel with a channel-specific T _f-sample filter. Using the stacked vector representation of the T-frame windows, the model of Eq. (7) can be extended to perform this convolution within each window. The resulting approximation is

$$ \mathbf{Y} \approx \mathbf{R} \mathbf{S} \mathbf{A}, $$

((8))

where, denoting T _r=T+T _f−1, R is a T _r C×T C matrix of the form

$$ \mathbf{R} = \left(\begin{aligned} \begin{array}{llll} r_{1,1} & 0 & 0 \\ 0 & r_{1,2} & 0 & \cdots \\ 0 & 0 & r_{1,3} \\ & \vdots & & \ddots \\ \end{array} \\ \underbrace{ \begin{array}{llll} r_{2,1} & 0 & 0 \\ 0 & r_{2,2} & 0 & \cdots \\ 0 & 0 & r_{2,3} \\ & \vdots & & \ddots \\ \end{array}}_{C} & \begin{array}{llll} r_{1,1} & 0 & 0 \\ 0 & r_{1,2} & 0 & \cdots \\ 0 & 0 & r_{1,3} \\ & \vdots & & \ddots \\ \end{array} \end{aligned}\right). $$

((9))

The diagonal structure of R is designed so that a left multiplication of a stacked window vector of T frames results in the discrete convolution of the filter $\left [ r_{1,c} r_{2,c} \cdots r_{T_{f},c} \right ]$ and the samples of the frequency channel c in that window. It is worth noting that Eq. (8) can be interpreted as either reverberating the clean speech estimate, R(S A), or making a linear combination of reverberated speech atoms, (R S)A.

5.1 Optimization of the filter and activation matrices

Following the supervised NMF model, the dictionary matrix S is held constant. In the sliding window model, the values of the filter and activation matrices R and A are obtained independently for each window t. Denoting by Y _t and A _t the corresponding columns of Y and A, the filter and activation matrices are set to minimize

$$ \sum_{t} \left(d(\mathbf{Y}_{t}, \mathbf{R} \mathbf{S} \mathbf{A}_{t}) + \lambda \left\| \mathbf{A}_{t} \right\|_{1} \right), $$

((10))

where the d(Y _t,R S A _t) term is a distance measure between the observation and the NMF approximation. The second term, which consists of the L ¹ norm ∥·∥ of the activation weights multiplied by the sparsity coefficient λ, is intended to induce sparsity in A and thereby yield a sparse representation of the observation. In this work, the generalized Kullback-Leibler divergence is used for d.

The form of Eq. (10) admits the use of conventional iterative NMF optimization algorithms [9, 13] to perform multiplicative updates to both the R and A matrices. However, the optimization problem is not convex, and a simple scheme of alternately updating R and A did not yield results useful for dereverberation in earlier experiments [17]. The reasons behind this are hypothesized in Section 8. Accordingly, we use the following series of steps to obtain the factorization R S A:

1.
A simpler dereverberation method is used to obtain an initial estimate of the non-reverberant speech of the observation, denoted by $\bar {\mathbf {X}}$. In this work, the estimate is obtained either through DM or MDI initialization, described in Sections 3 and 4, respectively.
2.
The activation matrix A is initialized to all ones and iteratively updated for I ₁ rounds to perform the factorization $\bar {\mathbf {X}} \approx \mathbf {S}\mathbf {A}$.
3.
While the dictionary atoms of S are strictly clean speech, the initial estimate $\bar {\mathbf {X}}$ is never perfectly dereverberated. Consequently, the activations A resulting from the preceding step will reflect the effects of reverberation, typically characterized by sequences of consecutive non-zero activations of the same dictionary atom. We therefore filter the time sequences of activations for each atom using a filter H _A(z) and clamp the result to be non-negative. This filtering step has the effect of biasing the following estimation of R to emphasize the reverberation.
4.
The filter matrix R is initialized to hold the constant T _f-sample filter $\frac {1}{T_{f}} \left [1 \cdots 1 \right ]$ for each frequency band. While keeping the A matrix fixed, R is iteratively updated for I ₂ rounds to minimize the cost in the approximation Y≈R S A. However, the multiplicative updates are neither guaranteed to preserve the filter structure described in Eq. (9), except for the zero elements, nor to result in a realizable filter. To enforce these properties, R is processed to have the form of Eq. (9) after each iteration: The new values of the filter coefficients r _t,c are obtained by averaging over all their occurrences in the updated R, and clamping large values to satisfy ∀t:r _t+1,c≤r _t,c. The coefficients are also uniformly scaled to $\sum _{t,c} r_{t,c} = C$.
5.
As a final step, the R matrix is kept fixed, and the A matrix is iteratively updated for I ₃ rounds based on Y≈R S A.

To demonstrate the behavior of the algorithm described above, Fig. 3 illustrates the cost function of Eq. (10) as a function of the update iterations. All three iterative stages of the algorithm are shown: I ₁=50 iterations of updating activations A based on the initial estimate $\bar {\mathbf {X}}$ in step 2, I ₂=50 iterations of updating the filter matrix R in step 4, and finally I ₃=75 further iterations to obtain the final values of A in step 5. The activation filtering in step 3 is reflected by a discontinuity in the cost function between steps 2 and 4. Note that the plotted cost function is based on the reverberant observation Y, which is not directly used as the optimization target in step 2. The cost function also measures only the accuracy of the reconstruction R S A and the sparsity of A and therefore does not indicate the dereverberation strength, which depends primarily on the filter represented by R.

A major drawback of this simple sliding window scheme in reverberant conditions occurs when the start of a window coincides with a silent interval in the underlying speech signal. In this case, the early frames of the window are dominated by observed reflections. When such a window is represented using a dictionary of individually reverberated atoms, the energy in the early frames is interpreted as direct sound and not properly attenuated.

To alleviate this issue, we use the NMFD [14] model, so that an individually reverberated dictionary atom activated in one window can “explain away” the energy of its reflections in succeeding overlapping windows. For the stacked vector representation, a computationally efficient implementation of the NMFD optimization scheme can be formulated by modifying the multiplicative update rule for the activation matrix A used in the iterative steps 2, 4, and 5 of the above algorithm.

For conventional NMF processing, the multiplicative update of matrix A that corresponds to the cost function given in Eq. (10) is defined as [13]

$$ \mathbf{A} \leftarrow \mathbf{A} \! .\!* \frac{(\mathbf{R}\mathbf{S})^{\top} \frac{\mathbf{Y}}{\mathbf{R}\mathbf{S}\mathbf{A}}}{(\mathbf{R}\mathbf{S})^{\top} \mathbf{1} + \lambda}, $$

((11))

where.∗ denotes elementwise multiplication, the division of two matrix operands is likewise performed elementwise, and 1 is a T _r C×N all-one matrix. We introduce the dependencies between consecutive windows by adjusting the $\frac {\mathbf {Y}}{\mathbf {R}\mathbf {S}\mathbf {A}}$ term, so that the new update rule is

$$ \mathbf{A} \leftarrow \mathbf{A} \! . \!* \frac{(\mathbf{R}\mathbf{S})^{\top} s\left(\frac{\mathbf{y}}{o(\mathbf{R}\mathbf{S}\mathbf{A})} \right)}{(\mathbf{R}\mathbf{S})^{\top} \mathbf{1} + \lambda}, $$

((12))

where y is the original, non-stacked observation spectrogram. In the update rule, the o(Z) function denotes the result of overlap-adding the stacked vectors of matrix Z to a single spectrogram (in the same way as in Eq. (5)), while the s(z) function denotes the conversion of spectrogram z to the stacked form. The corresponding change is also made to the update rule of the R matrix,

$$ \mathbf{R} \leftarrow \mathbf{R} \!.\!* \frac{s\left(\frac{\mathbf{y}}{o(\mathbf{R}\mathbf{S}\mathbf{A})} \right)(\mathbf{S}\mathbf{A})^{\top}}{\mathbf{1} (\mathbf{S}\mathbf{A})^{\top}}. $$

((13))

5.2 NMF-based feature enhancement of reverberant speech

Based on the factorization, we can directly reconstruct the reverberant observation as $\tilde {\mathbf {Y}} = \mathbf {R}\mathbf {S}\mathbf {A}$ and the underlying clean speech as $\tilde {\mathbf {X}} = \mathbf {S}\mathbf {A}$. By overlap-adding the stacked vectors, we obtain the corresponding Mel-scale spectrogram estimates $\tilde {\mathbf {y}}$ and $\tilde {\mathbf {x}}$. While $\tilde {\mathbf {x}}$ could be used directly as input for a speech recognition system, in existing work on NMF-based source separation for speech in additive noise [13], better performance was obtained by using the same Wiener-filtering approach we have described for the DM-based initialization. Therefore, we compute the final enhanced features, as in the DM method, by filtering the original observation with the time-varying Mel-spectral filter defined as $\tilde {\mathbf {x}} \ ./ \tilde {\mathbf {y}}$, where./ denotes elementwise division.

The full NMF-based feature enhancement algorithm is provided in pseudo-code form in Algorithm 1.

6 Experimental setup

6.1 Data set

The proposed feature enhancement method presented in the paper is evaluated on the 2014 REVERB Challenge data set [22]. The data set is only briefly described here. The first part of the data set, denoted by SimData, consists of an artificially reverberated British English version of the 5000-word Wall Street Journal corpus [28] mixed with recordings of background noise at a fixed signal-to-noise ratio (SNR) of 20 dB. SimData contains far and near microphone positions in three rooms of different size for a total of six recording scenarios. The second part of the REVERB Challenge data set contains real recordings, denoted by RealData, extracted from the multichannel Wall Street Journal audio visual corpus. The utterances of RealData have been recorded in a reverberant office space with background noise originating mostly from the air conditioning [29]. A summary of the SimData and RealData recording conditions is presented in the upper part of Table 1.

Table 1 Summary of recording conditions and data set parameters. SimData denotes artificially reverberated speech data with real RIRs and RealData denotes true recordings made in a reverberant room

Full size table

The data set is divided into speaker-exclusive training (clean speech), development, and evaluation sets. The RIRs are also different in the development and evaluation sets. The durations and the numbers of speakers and utterances of the sets are shown in the lower part of Table 1. In addition to the clean speech training set, an equal-sized multicondition (MC) training set is provided. The MC training data is artificially corrupted in the same manner as SimData but with unique impulse responses.

All the reverberant utterances in the REVERB Challenge data set are provided as single-channel, 2-channel, and 8-channel recordings. However, experiments in this study use either the single-channel setup, which is the main part of the study, or the 8-channel system in an additional experiment. The 8-channel system is constructed by applying a frequency domain delay-and-sum (DS) beamformer prior to the feature enhancement to investigate whether multichannel setups gain from the proposed method. The DS beamforming is briefly described in Section 6.4.

6.2 ASR system

In total, six feature enhancement, or front-end processing, combinations are applied in the evaluation; DM alone, NMF alone, DM-initialized NMF (denoted by DM+NMF), and MDI-initialized NMF (denoted by MDI+NMF). Moreover, the DM+NMF and MDI+NMF enhancements are combined with the additional DS beamformer in order to recognize the 8-channel audio. All systems with feature enhancements are trained on the MC training set.

The ASR back-end processing is performed using the publicly available Kaldi recognition toolkit [23] and the system utilized here is based on REVERB scripts provided in the toolkit. The use of Kaldi allows us to obtain results that are competitive with the state-of-the-art and also allows direct comparison with other studies that are based on the Kaldi back-end such as [6, 7, 30].

Two hybrid DNN-HMM and four GMM-HMM back-end systems of increasing acoustic model complexity are trained. The first back-end system, denoted by LDA+MLLT, is a triphone-based recognizer which uses feature vectors constructed from the first 13 of 23 Mel-frequency cepstral coefficients (MFCCs) drawn from nine consecutive frames. The feature vector dimensionality is reduced to 40 by linear discriminant analysis (LDA). Furthermore, a maximum likelihood linear transform (MLLT) is applied to improve the separability of acoustic classes in the feature space. The LDA+MLLT system is trained with the MC training set, but a similar system is also trained with the clean speech data for reference.

The second back-end system, denoted by LDA+MLLT+SAT, supplements the LDA+MLLT system with utterance-based speaker adaptive training (SAT). This is based on a variant of feature domain-constrained maximum likelihood linear regression (fMLLR) [31] designed for rapid adaptation on very small amounts of adaptation data.

The third back-end system, denoted by LDA+MLLT+SAT+f-bMMI, uses the acoustic model of the LDA+MLLT+SAT back-end to execute feature-space boosted maximum mutual information (f-bMMI) -based discriminative training [32]. The LDA+MLLT+SAT+f-bMMI is trained to obtain fully comparable single and 8-channel results with the feature enhancement proposed in [30] and comparable 8-channel results with [6]. In the experiments, we set the boost factor to 0.1.

The fourth back-end system, denoted by LDA+MLLT+SA+bMMI+MBR, is based on the LDA+MLLT system and supplements it with utterance-based fMLLR speaker adaptation, boosted MMI (bMMI), and minimum Bayes risk (MBR) decoding [33]. The LDA+MLLT+SA+bMMI+MBR system is trained to obtain fully comparable results with the feature enhancement proposed in [6].

The fifth back-end is a hybrid DNN-HMM system, denoted by LDA+MLLT+SAT+DNN, trained with the adapted features of the LDA+MLLT+SAT back-end using a frame-based cross-entropy criterion and p-norm nonlinearities [34]. The DNNs consisted of 4 hidden layers and approximately 6.3 million parameters. The sixth back-end, denoted by LDA+MLLT+SAT+DNN+SMBR, supplements the LDA+ MLLT+SAT+DNN back-end with state-level minimum Bayes risk (SMBR) criterion-based discriminative training [35] to obtain comparable results with the feature enhancement proposed in [30]. The SMBR training is applied only to the best performing LDA+MLLT+SAT+DNN back-end in the development set.

For the language model (LM), we use the 5000-word trigram model provided in the WSJ corpus. The LM weights are optimized separately for each back-end and for each feature enhancement combination, based on the averaged recognition word error rate (WER) over all eight test conditions in the development set. The optimized LM weights are also used in the estimation of fMLLR transformations for the first-pass recognition hypotheses.

6.3 Parameter setup

The parameter setups of the DM, MDI, and NMF methods use the same values as the best performing systems in the experiments of our previous studies [17, 18]. The settings are briefly summarized here. Mel-spectral features of T=20 subsequent frames were collected for each DM supervector. The PCA transformation in Eq. (2) was estimated from 1000 randomly selected clean-speech training set utterances and applied to reduce the supervector dimensionality to M=40 principal components. We have also conducted unpublished experiments utilizing both clean and reverberant data in the PCA training, which yielded slightly inferior ASR results compared to using only clean training data. The reasons behind this may be that is difficult to learn a transform that simultaneously decorrelates both clean speech and speech reverberated with a range of reverberation times, and it may be more important to decorrelate the target rather than the source domain prior to the mapping.

In the ASR experiments, the distribution mapping is applied in two iterations (see Section 3). The mapping function was updated every time that reverberation conditions changed and the ICDF $\Phi _{y}^{-1}$ of observations were collected from the full batch of utterances in each test condition. For the clean speech prior, we used a collection of random samples from the clean speech training set whose length was equal to that of the observation sample. Collectively there are three tunable paramaters in the DM initialization method (PCA-dimension M, stack dimension T and number of iterations).

Regarding the MDI system, the mask estimation stage requires three free parameters that were chosen to be the same (α=19, β=0.43, and γ=1.4) as in our earlier studies [17, 27]. In the imputation stage, we also utilized the same GMM-model as in our previous study; a 5-component GMM trained on a random 1000 utterance subset of the clean speech training set with a time context of three consecutive Mel-spectral feature frames. Taking together the parameters in the mask estimation as well as imputation stage totals to five tunable parameters.

For the NMF window length, we chose T=10 frames, which offered a good balance between dictionary complexity and ASR performance. The length of the NMF R matrix initialization filter that functions as an upper bound on the reverberation time the update algorithm can handle was set to T _f=20 samples to accommodate normal-sized rooms. The sparsity coefficient and iteration counts were set as follows: λ=1, I ₁=I ₂=50, and I ₃=100. The clean speech dictionary consisting of K=7 681 atoms was constructed by selecting one random T-frame segment from each clean speech training set utterance. The filter in step 3 of the update algorithm was optimized to give the NMF feature enhancement low average WER on all reverberation conditions and therefore it is not optimal for all the separate conditions. Based on multiple small-scale experiments, the filter was selected as H _A(z)=1−0.9z ⁻¹−0.8z ⁻²−0.7z ⁻³. From dozens of candidates, the selected filter was the only one to work well on all reverberation conditions.

6.4 Delay-and-sum beamforming

For the delay-and-sum (DS) beamforming feature enhancement, we use the implementation of [36]. To describe DS beamforming in brief, it selects one of the channels as the reference signal and the differences between the arrival times of the reference and the other channel signals are estimated by generalized cross-correlation with a phase transformation [37]. By delaying the other channels by their estimated arrival times and summing all the signals, the coherent sound sources are amplified and the SNR of the speech signal is increased. In this work, DS is applied to the 8-channel data on the LDA+MLLT+SAT+f-bMMI back-end.

6.5 Computational requirements

The overall real-time factor for both DM+NMF and MDI+NMF feature enhancements is approximately 6.9 on one thread of an Intel Xeon E3-1230V2 processor. There is no significant difference between the computational costs of the DM and MDI initialization methods, and the real-time factors for both methods are less than one. In fact, the NMF enhancement is the most computationally demanding processing stage of the whole ASR system. Since both initialization methods also utilize the same amount of training data, the benefit of the DM method over MDI is that there are only three free parameters to tune instead of five. During recognition, the DM method operates in full batch mode, whereas MDI works on an utterance-by-utterance basis.

7 Results

The ASR results for the REVERB Challenge development set are collected in Table 2 and for the evaluation set in Table 3. This section primarily reviews the evaluation set results of our systems. Comparable ASR results from external studies [6, 30] are also gathered in Table 3 and analyzed in Section 8. The feature enhancement combinations are grouped by their respective back-end systems. In Table 2, the results are shown as average WERs separately for the SimData and RealData recordings. In Table 3, the results are also shown for each recording condition, but the comparisons between the feature enhancement methods are based on their respective average WERs. For reference, the REVERB Challenge baseline results, with and without MC training and batch-based MLLR speaker adaptation, are shown in the first two rows of the result tables. The Challenge baselines make use of MFCC features concatenated with their first- and second-order derivatives and bigram LMs.

Table 2 Average SimData and RealData word error rates for the REVERB Challenge development set

Full size table

Table 3 Average SimData and RealData word error rates for the REVERB Challenge evaluation set

Full size table

For each back-end system, omitting the feature enhancement produces the highest error rates with the exception of DM enhancement on the LDA+MLLT+SAT+f-bMMI back-end, which gives the highest average error rate on RealData. For each back-end, the lowest error rates are obtained by taking advantage of either DM or MDI initialization in NMF feature enhancement, except for the LDA+MLLT back-end where NMF alone is the best performing feature enhancement. For each enhancement method, the corresponding average WERs are shown to decrease consistently on SimData while increasing the complexity of back-end processing. On RealData, however, none of the feature enhancements on the LDA+MLLT+SAT+DNN back-end is able to exceed their respective average results with the LDA+MLLT+SAT+f-bMMI back-end.

For both single-channel SimData and RealData, the proposed DM+NMF feature enhancement outperforms MDI+NMF for the majority of back-end systems. The WER improvements for the proposed DM+NMF method over the MDI+NMF are 0.45 % and 0.9 % on LDA+MLLT+SAT+DNN and LDA+MLLT+SAT+f-bMMI back-ends, respectively. On 8-channel recordings, DS+DM+NMF produces the lowest average WER on SimData, whereas DS+MDI+NMF gives the best performance on RealData.

8 Discussion

We have shown that the proposed DM+NMF feature enhancement achieves the highest average performances on both single-channel SimData and RealData recordings. However, these highest performance figures are achieved by a small margin relative to MDI+NMF and NMF and with different back-ends. DM+NMF is also conceptually simpler than our previous MDI+NMF approach, with fewer parameters to optimize. It also gives a performance advantage compared to the systems of Weninger et al. [6] and Tachioka et al. [30]. In the following subsections, we discuss the principles underlying our approach and how these give rise to the performance gains observed and then compare our results with those from other studies.

8.1 The principles of the approach

The main features of the enhancement method proposed in the current study are that it is unsupervised and makes only weak assumptions about the reverberation in both the DM and NMF stages. In contrast to DM, the MDI front-end requires a measurement of the extent of reverberation which is mapped to masked thresholds utilizing a function with three experimentally adjusted free parameters [27]. In the DM initialization, the two main assumptions are that reverberation effects are convolutive and long term, and that the same transformations can be used to decorrelate each reverberation condition. In the NMF stage, reverberation is again assumed to be convolutive with a long-term effect. The activation filtering assumes certain characteristics of temporal modulation patterns of activations that are common to all rooms. Therefore, neither the DM initialization nor NMF make assumptions relating to any specific room.

That said, the unsupervised nature of the proposed method also raises some challenges. The cost function we use measures the success of reconstructing the original observed speech, but its relation to the dereverberation or room characteristics is indirect (see Fig. 3). Therefore, it is possible for the cost function to converge even when the method does not apply dereverberation. This also explains why we needed to modify the iterative update rules to implement the NMFD model—our preliminary experiments conducted with and without initialization showed that the cost function converged, but the NMF dereverberation was not successful.

The filtering of the activation matrix by H _A, done in step 3 of the NMF update algorithm, is motivated by the need to remove traces of reverberation that remain in matrix A. These traces are caused by imperfections in the initial estimation stage and by the first stage of NMF reconstruction before the filter update is applied (step 2 and Fig. 3). More specifically, filtering the activation matrix by H _A serves to move the traces of reverberation that remain in A to matrix R, which is updated in the next stage of iterations (step 4). The filtering scheme is similar to other approaches that apply modulation filters to counteract reverberation (e.g. [5]). It emphasizes reverberation-free speech onsets through a smoothed derivator filter along the time trajectory; not in spectrograms as in earlier studies, but in the activations A. The filtering also increases the sparsity of A. After the matrix R update iterations (step 4), the following activation matrix A update (step 5) does not use activation filtering. The filtering scheme is motivated by the notion that it is more useful to model reverberation as much in matrix R as possible. The reason for this, as discussed above, is that the NMF cost function measures the precision of reconstruction of the original reverberant speech, rather than dereverberation that should be left for matrix R. Note that matrix R is updated only once, as our preliminary experiments revealed that by alternating the R and A updates, it is difficult to obtain stable estimates for both matrices. Our hypothesis is that either the cost function optimized by NMF is not optimal for reverberation or that the optimization algorithm gets easily stuck in local minima. Evidence supporting the former explanation is that increasing the iteration counts did reduce the cost function but impaired the recognition performance.

Considering the initialization step in the NMF algorithm on the most complex LDA+MLLT+SAT+f-bMMI and LDA+MLLT+SAT+DNN back-ends, the results indicate that it is beneficial to apply dereverberation during initialization. However, on the less complex LDA+MLLT+SAT back-end, the benefit is negligible and on the least complex LDA+MLLT back-end, the initialization step is detrimental as the NMF alone provides the lowest average WERs on both SimData and RealData.

Our previous studies [17, 18] have shown that DM outperforms MDI by a small margin in feature enhancement as it achieves 37.87 % and 72.25 % average WERs on the REVERB Challenge SimData and RealData recordings, respectively, while MDI yields 39.14 and 71.67 %. This observation may also explain why DM is better than MDI when applied as the initialization method. However, we cannot conclude that any better dereverberation method used to initialize NMF would also lead to better factorization. For instance in [17], experiments were conducted using NMF and MDI as separate feature enhancement methods for a system with acoustic models trained on unenhanced MFCCs. For non-reverberant speech signals, the MDI feature enhancement had no notable impact on performance compared to the clean-speech-trained baseline (the authors report WERs of 12.70 and 12.55 %, respectively). However, the MDI-initialized NMF feature enhancement severely degraded the clean speech recognition accuracy (17.37 % WER), because the NMF introduced prominent artifacts in the speech signals.

8.2 Comparison to similar studies

As discussed in Section 8.1, one key factor of our two step feature enhancement is the ability to generalize. Our approach is based on unsupervised learning, in which a filter with an arbitrary impulse response can be learned from data, and arbitrary speech utterances can be modeled through the combination of dictionary atoms using NMF. Accordingly, the dereverberation approach generalizes well to unseen data. In contrast, the RNN-based system in [6] requires supervised training and may become over-trained to particular reverberation conditions or speaker attributes. This may limit its ability to generalize to unseen data. Evidence that our system generalizes comparatively well to unseen room conditions can be found by comparing the SimData and RealData results for our system and the Weninger et al. system. Relative error reduction (calculated between average results of our DM+NMF method and LDA+MLLT+SA+bMMI+MBR back-end and the Weninger et al. system) for our system compared to the Weninger et al. system is twice as large for RealData (6.0 %) than for SimData (3.0 %), indicating better performance for our system in mismatched conditions.

A closer examination of the results obtained with LDA+MLLT+SAT+f-bMMI and LDA+MLLT+SAT+DNN back-ends reveals that although the MDI+NMF and DM+NMF feature enhancements benefit the DNN-based back-end system in terms of SimData performance, the improvements on RealData are not as large as with f-bMMI discriminative training. This may be due to non-optimal DNN training, as the risk of over-training is relatively prominent with DNNs.

The feature enhancement method of Tachioka et al. [30] is based on blind reverberation time estimation for a dereverberation process similar to spectral subtraction. Our method, on the other hand, does not make use of reverberation time but makes only weak assumptions about the reverberation conditions, as discussed in Section 8.1. With the LDA+MLLT+SAT+f-bMMI back-end, the DM-only feature enhancement achieves nearly as good a performance as Tachioka et al., with a relative average error increase of 4.1 % on SimData and 0.9 % on RealData. In our previous study [17], the MDI system based on the same mask estimation method as in the current study was shown to outperform an MDI method with mask estimation based on assessment of room reverberation. These findings imply that the final recognition performance can be significantly degraded by inaccuracies in reverberation estimates. In multichannel recordings, the Tachioka et al. system invokes DS beamforming with a cross-spectrum phase analysis and a peak-hold process for the direction of arrival estimation. While the beamforming in [30] is essentially an improved version of our DS implementation, our results indicate that a conventional DS performs better for the REVERB Challenge data. This is apparent from the observation that the relative difference between the average error rates of Tachioka et al. and DM+NMF are larger on 8-channel than on single-channel setups, for both SimData and RealData.

Even though our average DNN+SMBR discriminative training-based results (9.17 %) are slightly better than the comparable DNN+bMMI results of Tachioka et al. (9.77 %) on SimData, the Tachioka et al. system provides higher average performance on RealData (26.56 % vs. 25.83 %, respectively). It is also noteworthy that in our experiments, discriminative training brought little benefit to the DNN system, whereas a more significant improvement was seen for Tachioka et al.’s DNN back-end. The best single-channel results in the study of Tachioka et al. are obtained by combining the results from 16 separate recognition systems by using recognizer output voting error reduction (ROVER). The average WERs for the ROVER system are 8.51 % for SimData and 23.70 % for RealData. To put things in perspective, the best performing single-channel recognizer in the REVERB Challenge, proposed by Delcroix et al. [38], achieved average WERs of 5.2 % on SimData and 17.4 % on RealData. The most significant benefit of the Delcroix et al. system compared to ours lies in the acoustic model, which has higher input dimensionality and was trained on an extended data set approximately five times the size of the REVERB Challenge training data set. The Delcroix et al. system also operated in full-batch mode.

9 Conclusions

This paper proposed a two-stage feature enhancement method for dereverberation of speech for noise robust ASR, based on a combination of distribution matching and non-negative matrix factorization. The proposed method was evaluated with modern ASR back-ends based on variants of the GMM-HMM and DNN-HMM frameworks and shown to outperform our previous combination of missing data imputation and NMF [17] by a small margin. In several instances, the proposed method also gave higher recognition accuracy than the state-of-the-art reference approaches by [6, 30] with similar back-end processing. The main benefit of the proposed method over the reference approaches is that it generalizes well to unseen reverberation conditions. This was reflected in the most difficult real-data scenarios in the REVERB Challenge, where our DM+NMF-based ASR systems achieve the largest performance gains over reference approaches. Moreover, the NMF alone and MDI+NMF-based systems were also shown to perform well with respect to the reference approaches.

References

G Hinton, L Deng, D Yu, G Dahl, A Mohamed, N Jaitly, A Senior, V Vanhoucke, T Sainath, B Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Proc. Mag. 29(6), 82–97 (2012).
Article Google Scholar
JT Geiger, JF Gemmeke, B Schuller, G Rigoll, in Proc. INTERSPEECH. Investigating NMF speech enhancement for neural network based acoustic models (IEEE Singapore, Singapore, 2014).
S Thomas, S Ganapathy, H Hermansky, Recognition of reverberant speech using frequency domain linear prediction. IEEE Signal Proc. Let. 15, 681–684 (2008).
Article Google Scholar
B Kingsbury, N Morgan, S Greenberg, Robust speech recognition using the modulation spectrogram. Speech Commun. 25, 117–132 (1998).
Article Google Scholar
KJ Palomäki, GJ Brown, JP Barker, Techniques for handling convolutional distortion with ‘missing data’ automatic speech recognition. Speech Commun. 43(1–2), 123–142 (2004).
Article Google Scholar
F Weninger, S Watanabe, J Le Roux, JR Hershey, Y Tachioka, J Geiger, B Schuller, G Rigoll, in Proc. REVERB Workshop (REVERB’14). The MERL/MELCO/TUM system for the REVERB Challenge using deep recurrent neural network feature enhancement (Florence, Italy, 2014).
JT Geiger, E Marchi, B Schuller, G Rigoll, in Proc. REVERB Workshop (REVERB’14). The TUM system for the REVERB Challenge: recognition of reverberated speech using multi-channel correlation shaping dereverberation and BLSTM recurrent neural networks (Florence, Italy, 2014).
A Sehr, R Maas, W Kellermann, Reverberation model-based decoding in the logmelspec domain for robust distant-talking speech recognition. IEEE Trans. Audio, Speech, Language Process. 18(7), 1676–1691 (2010).
Article Google Scholar
DD Lee, HS Seung, in Adv. Neur. In. 13, ed. by TK Leen, TG Dietterich, and V Tresp. Algorithms for non-negative matrix factorization (MIT PressCambridge, 2001), pp. 556–562.
T Virtanen, Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE T. Audio Speech. 15(3), 1066–1074 (2007).
Article Google Scholar
P Smaragdis, JC Brown, in IEEE Workshop Applicat. Signal Process. Audio and Acoust. Non-negative matrix factorization for polyphonic music transcription (IEEENew Paltz, NY, USA, 2003), pp. 177–180.
KW Wilson, B Raj, P Smaragdis, A Divakaran, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP). Speech denoising using nonnegative matrix factorization with priors (IEEELas Vegas, NV, USA, 2008), pp. 4029–4032.
JF Gemmeke, T Virtanen, A Hurmalainen, Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE T. Audio Speech. 19(7), 2067–2080 (2011).
Article Google Scholar
P Smaragdis, in Independent Component Analysis and Blind Signal Separation. Lecture Notes in Computer Science, 3195, ed. by CG Puntonet, A Prieto. Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs (SpringerBerlin Heidelberg, 2004), pp. 494–499.
H Kameoka, T Nakatani, T Yoshioka, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP). Robust speech dereverberation based on non-negativity and sparse nature of speech spectrograms (IEEETaipei, Taiwan, 2009), pp. 45–48.
K Kumar, R Singh, B Raj, R Stern, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP). Gammatone sub-band magnitude-domain dereverberation for ASR (IEEEPrague, Czech Republic, 2011), pp. 4604–4607.
H Kallasjoki, JF Gemmeke, KJ Palomäki, AV Beeston, GJ Brown, in Proc. REVERB Workshop (REVERB’14). Recognition of reverberant speech by missing data imputation and NMF feature enhancement (Florence, Italy, 2014).
K Palomäki, H Kallasjoki, in Proc. REVERB Workshop (REVERB’14). Reverberation robust speech recognition by matching distributions of spectrally and temporally decorrelated features (Florence, Italy, 2014).
U Remes, in Proc. INTERSPEECH. Bounded conditional mean imputation with an approximate posterior (ISCALyon, France, 2013), pp. 3007–3011.
AV Beeston, GJ Brown, in UK Speech Conf. Modelling reverberation compensation effects in time-forward and time-reversed rooms (Cambridge, UK, 2013).
S Dharanipragada, M Padmanabhan, in Proc. Int. Conf. Spoken Lang. Process. (ICSLP). A non-linear unsupervised adaptation technique for speech recognition (ISCABeijing, 2000).
K Kinoshita, M Delcroix, T Yoshioka, T Nakatani, E Habets, R Haeb-Umbach, V Leutnant, A Sehr, W Kellermann, R Maas, S Gannot, B Raj, in Proc. IEEE Workshop Applicat. Signal Process. Audio and Acoust. (WASPAA). The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech (IEEENew Paltz, NY, USA, 2013).
D Povey, A Ghoshal, G Boulianne, L Burget, O Glembek, N Goel, M Hannemann, P Motlicek, Y Qian, P Schwarz, J Silovsky, G Stemmer, K Vesely, in IEEE Automat. Speech Recognition and Understanding Workshop. The Kaldi speech recognition toolkit (IEEEWaikoloa, HI, USA, 2011).
SM Pizer, EP Amburn, JD Austin, R Cromartie, A Geselowitz, T Greer, JB Zimmerman, K Zuiderveld, Adaptive histogram equalization and its variations. Comput. Vision Graph. 39(3), 355–368 (1987).
Article Google Scholar
G Saon, S Dharanipragada, D Povey, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), 1. Feature space Gaussianization (IEEEMontreal, Canada, 2004), pp. 329–332.
CB Moler, Numerical Computing with MATLAB, Revised Reprint Paperback (Society of Industrial and Applied Mathematics, Philadelphia, Pennsylvania, 2008).
KJ Palomäki, GJ Brown, JP Barker, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP). Recognition of reverberant speech using full cepstral features and spectral missing data (IEEEToulouse, France, 2006).
T Robinson, J Fransen, D Pye, J Foote, S Renals, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP). WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition (IEEEDetroit, MI, USA, 1995).
M Lincoln, I McCowan, J Vepa, HK Maganti, in IEEE Automat. Speech Recognition and Understanding Workshop. The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments (IEEECancún, Mexico, 2005).
Y Tachioka, T Narita, F Weninger, S Watanabe, in Proc. REVERB Workshop (REVERB’14). Dual system combination approach for various reverberant environments with dereverberation techniques (Florence, Italy, 2014).
D Povey, K Yao, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP). A basis method for robust estimation of constrained MLLR (IEEEPrague, Czech Republic, 2011), pp. 4460–4463.
D Povey, D Kanevsky, B Kingsbury, B Ramabhadran, G Saon, K Visweswariah, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP). Boosted MMI for model and feature-space discriminative training (IEEELas Vegas, NV, USA, 2008), pp. 4057–4060.
H Xu, D Povey, L Mangu, J Zhu, Minimum Bayes risk decoding and system combination based on a recursion for edit distance. Comput. Speech Lang. 25(4), 802–828 (2011).
Article Google Scholar
X Zhang, J Trmal, D Povey, S Khudanpur, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP). Improving deep neural network acoustic models using generalized maxout networks (IEEEFlorence, Italy, 2014).
B Kingsbury, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP). Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling (IEEETaipei, Taiwan, 2009), pp. 3761–3764.
MF Font, Multi-microphone signal processing for automatic speech recognition in meeting rooms. Master’s thesis, Universitat Politècnica de Catalunya, Spain, 2005.
CH Knapp, GC Carter, The generalized correlation method for estimation of time delay. IEEE T. Acoust. Speech. 24(4), 320–327 (1976).
Article Google Scholar
M Delcoix, T Yoshioka, A Ogawa, Y Kubo, M Fujimoto, N Ito, K Kinoshita, M Espi, T Hori, T Nakatani, A Nakamura, in Proc. REVERB Workshop (REVERB’14). Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB Challenge (Florence, Italy, 2014).

Download references

Acknowledgements

The research was supported by the Academy of Finland projects 136209 (Sami Keronen, Kalle J. Palomäki) and 251170 (Heikki Kallasjoki and Kalle J. Palomäki). Guy J. Brown was supported by the EU project Two!Ears under grant agreement ICT-618075.

Author information

Authors and Affiliations

Department of Signal Processing and Acoustics, Aalto university, P.O. Box 13000, Aalto, 00076, Finland
Sami Keronen, Heikki Kallasjoki & Kalle J. Palomäki
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield, S1 4DP, UK
Guy J. Brown
Audience, Inc., Mountain View, 94043, CA, USA
Jort F. Gemmeke

Authors

Sami Keronen
View author publications
You can also search for this author in PubMed Google Scholar
Heikki Kallasjoki
View author publications
You can also search for this author in PubMed Google Scholar
Kalle J. Palomäki
View author publications
You can also search for this author in PubMed Google Scholar
Guy J. Brown
View author publications
You can also search for this author in PubMed Google Scholar
Jort F. Gemmeke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sami Keronen.

Additional information

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Keronen, S., Kallasjoki, H., Palomäki, K.J. et al. Feature enhancement of reverberant speech by distribution matching and non-negative matrix factorization. EURASIP J. Adv. Signal Process. 2015, 76 (2015). https://doi.org/10.1186/s13634-015-0259-1

Download citation

Received: 13 February 2015
Accepted: 31 July 2015
Published: 20 August 2015
DOI: https://doi.org/10.1186/s13634-015-0259-1

Feature enhancement of reverberant speech by distribution matching and non-negative matrix factorization

Abstract

1 Introduction

2 Overview of the dereverberation process

3 Distribution matching initialization

4 Missing data imputation initialization

5 Non-negative matrix factorization of reverberant speech

5.1 Optimization of the filter and activation matrices

5.2 NMF-based feature enhancement of reverberant speech

6 Experimental setup

6.1 Data set

6.2 ASR system

6.3 Parameter setup

6.4 Delay-and-sum beamforming

6.5 Computational requirements

7 Results

8 Discussion

8.1 The principles of the approach

8.2 Comparison to similar studies

9 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords