Environment-dependent denoising autoencoder for distant-talking speech recognition

In this paper, we propose an environment-dependent denoising autoencoder (DAE) and automatic environment identification based on a deep neural network (DNN) with blind reverberation estimation for robust distant-talking speech recognition. Recently, DAEs have been shown to be effective in many noise reduction and reverberation suppression applications because higher-level representations and increased flexibility of the feature mapping function can be learned. However, a DAE is not adequate in mismatched training and test environments. In a conventional DAE, parameters are trained using pairs of reverberant speech and clean speech under various acoustic conditions (that is, an environment-independent DAE). To address the above problem, we propose two environment-dependent DAEs to reduce the influence of mismatches between training and test environments. In the first approach, we train various DAEs using speech from different acoustic environments, and the DAE for the condition that best matches the test condition is automatically selected (that is, a two-step environment-dependent DAE). To improve environment identification performance, we propose a DNN that uses both reverberant speech and estimated reverberation. In the second approach, we add estimated reverberation features to the input of the DAE (that is, a one-step environment-dependent DAE or a reverberation-aware DAE). The proposed method is evaluated using speech in simulated and real reverberant environments. Experimental results show that the environment-dependent DAE outperforms the environment-independent one in both simulated and real reverberant environments. For two-step environment-dependent DAE, the performance of environment identification based on the proposed DNN approach is also better than that of the conventional DNN approach, in which only reverberant speech is used and reverberation is not blindly estimated. And, the one-step environment-dependent DAE significantly outperforms the two-step environment-dependent DAE.


Introduction
In a distant-talking environment, channel distortion drastically degrades speech recognition performance because of mismatches between the training and test environments. There are two different approaches, namely, front-and back-end-based methods, for dealing with this problem [1]. Many front-end-based approaches [1][2][3][4][5][6][7][8] have been proposed to reduce the effect of reverberation *Correspondence: wang@vos.nagaokaut.ac.jp 2 Nagaoka University of Technology, 1603-1 Kamitomioka, Nagaoka 940-2188, Japan Full list of author information is available at the end of the article in the observed speech signal. The back-end-based methods, on the other hand, attempt to modify the acoustic model and/or decoder to suit the respective reverberant environment [9,10]. In this paper, we focus on front-endbased approaches for distant-talking speech recognition.
Many single-and multi-channel dereverberation methods have been proposed to suppress reverberation [2][3][4][11][12][13][14]. Single-channel dereverberation approaches are much easier and cheaper to implement in real applications than multi-channel ones. In this paper, dereverberation is performed using a single-channel speech signal. Cepstral mean normalization (CMN) can be considered the most general single-channel approach [15][16][17]. Having been extensively examined, it has been shown to be a simple and effective way of reducing reverberation by normalizing cepstral features. However, the dereverberation of CMN is not completely effective in environments with late reverberation. Several studies have focused on mitigating this problem [3,4,12]. A reverberation compensation method for speaker recognition using spectral subtraction [18], where late reverberation is treated as additive noise, was proposed in [3]. A method based on multi-step linear prediction (MSLP) was proposed for both single and multiple microphones [4,12]. The method first estimates late reverberations using long-term MSLP, and then suppresses these with subsequent spectral subtraction. Wolfel proposed a joint compensation of noise and reverberation by integrating an estimate of the reverberation energy derived by an auxiliary model based on MSLP, into a framework, which so far, tracks and removes nonstationary additive distortion by particle filters in a low-dimension logarithmic power frequency domain [19].
Neural network (NN)-based approaches have been proposed for feature transformation [20,21]. Bottleneck features extracted by a multi-layer perceptron (MLP) can be used for nonlinear feature transformation [20]. However, deep networks of MLPs with many hidden layers have a high computational cost, the lower layers in a DNN architecture are hard to train because of vanishing gradients. Deep belief networks, which employ an unsupervised pre-training method using a restricted Boltzmann machine (RBM), have been proposed to train better initial values of deep networks [22]. Deep neural networks (DNNs) with pre-training have been shown to achieve better performance than the conventional MLP without pre-training [22]. There are many DNN, recurrent neural network (RNN) and long short-term memory (LSTM) [23][24][25][26][27][28][29] based speech enhancement and feature enhancement approaches that have been proposed for speech enhancement for human listening and robust speech recognition and that have shown good performance for the REVERB challenge task [40]. Recently, the denoising autoencoder (DAE), one type of DNN, has been shown to be effective in many noise reduction applications because higher-level representations and increased flexibility of the feature mapping function can be learned [30][31][32][33]. Ishii et al. applied a DAE to spectral-domain dereverberation resulting in improved word accuracy of large-vocabulary continuous speech recognition (LVCSR) [34]. Previously, we found that cepstral domain DAE-based dereverberation is efficient for distant-talking speech recognition [35]. As shown in [35], DAE worked well especially with strong reverberation. However, the results of DAE with small reverberation are not good compared to other methods. Typically, in the training of a DAE [26,34,35], data incorporating various environmental conditions are used.
Although this training method is suitable for training models in various environments, the nonlinear transformation ability of DAE training using various conditions that do not match those of the test data is lower for certain acoustic conditions of the test set. Thus, the performance of a DAE cannot be sufficiently improved for an unknown test reverberant condition.
To improve robustness of speech recognition, the idea of using side information from the environment as additional features, such as speaker-specific side information (e.g., i-vectors) and room information etc. has been proposed previously [36][37][38]. In this paper, two environmentdependent DAEs are proposed to reduce the influence of mismatches between training and test environments, that is, DAEs are trained and used corresponding to the different environments. In the first approach, various DAEs are trained using speech from different acoustic environments, and the DAE with the condition that best matches the test condition is automatically selected using a DNN (that is, a two-step environment-dependent DAE). The performance of our proposed two-step environmentdependent DAE is dependent on the precision of the automatic environment identification. In this paper, to achieve higher environment identification performance, a DNN using both reverberant speech and reverberation estimated by MSLP is also proposed. In the second approach, reverberation features estimated by MSLP are directly used as an input of the DAE (that is, a one-step environment-dependent DAE or a reverberation-aware DAE). By simultaneously estimating and suppressing the environment-dependent reverberation with a onestep environment-dependent DAE (that is, reverberationaware DAE), the mismatch between the training data and test data will be reduced. Therefore, better estimation of the clean speech can be expected. In previous work, conventional DAE was trained using speech data under various environments, and the test reverberant speech is transformed using a conventional environmentindependent DAE that cannot deal with environmental variation when there is limited training data. In the proposed approach, the test reverberant speech is transformed using an environment-dependent DAE that can estimate the environment-dependent reverberation. Thus, the environment-dependent DAE is more robust to environmental changes than a conventional DAE. The proposed methods are evaluated in both simulated and real reverberant environments.
The remainder of this paper is organized as follows: Section 2 describes the DAE for cepstral-domain dereverberation. The methods for one-step and two-step environment-dependent DAEs are described in Section 3, while the experimental results and a discussion thereof are presented in Section 4. Finally, Section 5 summarizes the paper.

Topology of DAE
An autoencoder is a type of artificial neural network (NN), whose output is reconstruction of input, and is often used for dimensionality reduction. DAEs share the same structure as autoencoders, but input data are a noisy version of the teacher signal. In this paper, we use the clean reference speech signal as teacher signal. Autoencoders use feature mapping to convert noisy input data into clean output and have been used for noise removal in the field of image processing [30]. Ishii et al. applied a DAE for spectral-domain dereverberation [34]. However, the suppressed spectral-domain feature needs to be converted to a cepstral-domain feature, and this improvement is not sufficient. In this paper, we apply a denoising autoencoder for cepstral-domain dereverberation because there are many LVCSR systems that adopt a cepstral-domain feature as the direct input. Given a pair of speech samples, clean speech and corresponding reverberant speech, DAE learns the non-linear conversion function that converts reverberant speech features into clean speech. In general, reverberation is dependent on both current and several previous observation frames. In addition to the vector of the current frame, vectors of past frames are concatenated to form input.
For cepstral feature X i of observed reverberant speech of the i−th frame, cepstral features of N −1 frames before the current frame are concatenated with the current frame to form a cepstral vector of N frames. Output O i of the non-linear transformer based on the DAE is given by where f l is the non-linear transformation function in layer l and N is the number of frames to be used as the input features.
Topology of the cepstral-domain DAE for dereverberation is shown in Fig. 1. In this paper, the number of hidden layers is set to three. In Fig. 1, W i (i = 1, 2) shows the weighting of the different layers, and W T i shows the transposition of W i 1 . That is to say, W 1 and W 2 are the encoder matrices and W T 1 and W T 2 are the decoder matrices, respectively.

Restricted Boltzmann machine
To train a deep neural network, deep belief networks (DBNs) [22] are used for pre-training because they can obtain accurate initial values of the deep-layer neural networks.
RBM is a bipartite graph shown in Fig. 2. It has a visible and hidden layer in which visible units that represent observations are connected to hidden units that learn to represent features using weighted connection. An RBM is restricted such that there are no visible-visible or hidden-hidden connections. Different types of RBMs are used in the case of binary or real-valued input. Bernoulli-Bernoulli RBMs are used to convert binary stochastic variables to binary stochastic variables. Gaussian-Bernoulli RBMs are used to convert real-valued stochastic variables to binary stochastic variables. Details of RBM are obtained in [22].
To obtain a pre-trained RBM, we trained all the hidden layers by using the Bernoulli-Bernoulli RBM. DBNs are hierarchically configured by connecting these pre-trained RBMs. Here, W 1 and W 2 are learned automatically, and W T 1 and W T 2 are generated from W 1 and W 2 in Fig. 1.

Backpropagation algorithm
After pre-training, a backpropagation algorithm was applied to adjust the parameters. Backpropagation modifies the weights of the network to reduce the error of the teacher signal and the output value when a pair of signals (input signal and the ideal teacher signal, the cepstral feature of clean speech) are given. We scaled the cepstral feature value of the input data and teacher signal to between 0 and 1 using a sigmoid function. The minimization in this paper is carried out by minimizing the cross entropy using conjugate gradients [22].

Environment-dependent denoising autoencoder
A conventional DAE trained using data under various acoustic conditions is effective for noise reduction and dereverberation. However, it is impossible to deal with mismatched conditions of the training and test data or unseen data with limited training data. We deal with this problem by two approaches. In the first approach, multiple DAEs for each environment are trained and selectively used. However, the reverberation environment is unknown in the test stage. Here, we use a DNN for environment identification because it is more effective than other classifiers such as the Gaussian mixture model (GMM) and support vector machine for audio classification [39]. In the second approach, reverberation features estimated by MSLP are directly used as an input of the DAE (that is, a one-step environment-dependent DAE or a reverberation-aware DAE). In the following, we describe the proposed environment-dependent DAE.

Environment-independent and environment-dependent DAEs
For a conventional DAE (environment-independent DAE), parameters are trained using pairs of reverberant speech and clean speech under various acoustic conditions. The environment-independent DAE is not robust for mismatches between training and test conditions. To address this problem, we propose two environmentdependent DAEs to mitigate the influence of mismatch in the training and test environments. In the first approach of environment-dependent DAE (that is, two-step environment-dependent DAE), various DAEs are trained using speech from different acoustic environments, and the DAE with the condition that best matches the test condition is automatically selected as shown in Fig. 3.
In the second approach (that is, one-step environmentdependent DAE), we add the environmental information (e.g., estimated reverberation features) to input of DAE as shown in Fig. 4. These approaches are expected for reducing the influence of mismatch between training and test environment, and improving LVCSR performance.

Environment identification
First, we divide the training data according to the environment. This is performed manually because the training data environment is known. Next, each DAE, with its respective data and the environment identification model, is trained. The training approach for the environment identification model is the same as that for the DAE with the exception of the input data and giving the reverberation environment as the correct label. In the DAE, the second half of weights of DAE is generated from Although reverberant speech features are used as input in the training of a DAE, this is not sufficient in the training of the identification model. As shown in Fig. 4, reverberation is estimated from reverberant speech, with reverberation features also used as input. By doing this, we can expect the performance of environment identification to be improved. In this paper, estimation of reverberation is based on MSLP [12]. The original MSLP algorithm estimates the reverberation from reverberant speech and suppresses reverberation in the spectral domain. In this study, MSLP is used for reverberation estimation only, and reverberation suppression is not performed.
where y(n) is the observed signal, D is the step size, N is the filter length, w(p) are prediction coefficients, and e(n) is the prediction error. Prediction filter w(n) is estimated by minimizing the mean square energy of the prediction error. Late reverberation is estimated using reverberant speech and the estimated prediction coefficients. Figure 5 shows the flowchart for environment-dependent dereverberation using the environment identification technique and multiple DAEs. First, we identify the reverberation environment of the input speech by applying an identification model to the speech. Here, the reverberation features estimated by MSLP are used as input for the model as well as for the training thereof. Next, the DAE corresponding to the identified environment is selected, Fig. 5 Flowchat of environment-dependent DAE and automatic environment identification and dereverberation is applied by it. Since dereverberation is applied by a DAE suited to the environment of the speech, we expect an improvement in performance.

One-step environment-dependent DAE
The second approach is almost the same as that for the environment identification in Section 3.2.1.
In this approach, estimated reverberation and reverberant speech are directly used as inputs of the reverberation-aware (DAE) as shown in Fig. 4. The one-step environment-dependent DAE can estimate and suppress the environment-dependent reverberation automatically. So, it is expected to reduce the influence of mismatches between training and test environments.
On the other hand, the conventional DAE does not use environment-dependent reverberation, so its performance will not be robust for mismatches between training and test conditions.

Experimental setup 4.1.1 Training dataset
We used the training dataset provided by the "REVERB challenge" (reverberant voice enhancement and recognition benchmark) [40]. This dataset consists of the clean WSJCAM0 [41] training set and a multi-condition (MC) training set. Reverberant speech is generated from the clean WSJCAM0 training data by convolving the clean utterances with measured room impulse responses and adding recorded background noise. The reverberation times of the measured impulse responses range from approximately 0.1 to 0.8 s. The training data of the "REVERB Challenge" were used to train the environment identification DNN. The environment labels depended on room type and the distance between the microphone and speaker. The training data include three types of rooms and two types of distances between the microphone and speaker, so in total, six types of environments with distinct rooms and distances were used. This training dataset were also used to train the DAEs and acoustic models. It should be noted that the recording rooms used for the multi-condition training data and test data were different.

Development and evaluation test sets
It is important to note that the proposed dataset consists of real recordings (RealData) and simulated data (SimData), part of which has similar characteristics to the RealData in terms of reverberation time and microphonespeaker distance. This setup allowed us to perform evaluations in terms of both practicality and robustness of various reverberant conditions. Specifically, the development (Dev.) and final evaluation (Eval.) test sets each contained the following SimData and RealData; SimData was generated from the WSJCAM0 corpus [41], and RealData from the MC-WSJ-AV corpus [42]. This development dataset was used to determine the optimal parameters for dereverberation and speech recognition. Details of the training and test datasets are given in Tables 1 and 2.

Experimental conditions for LVCSR and dereverberation
In this study, Mel-frequency cepstral coefficients (MFCCs) were used as features for LVCSR. The dimension of the MFCCs was 39 including 12 MFCCs plus power and their Delta and Delta-Delta coefficients. MFCC features were normalized using the mean of the entire multi-condition training set. DAE training was carried out using mini-batch conjugate gradients with a mini-batch size of 128 samples. In this paper, the number of hidden layers is set to three. The number of units in each layer is 512, and each unit uses a sigmoid function as an activation function. Because reverberation affects multiple frames, we supply multiple frames at the same time as the input and teacher signals of the DAE. The dimensions of the input data and output are 39 (the dimensions of MFCC per frame) * 9 (the number of segments to supply at the same time) = 351. These parameters were empirically determined. Please refer to [35] for a more detailed description. Fifty epochs with a learning rate of 0.002 were used for all layers during pre-training, and 100 epochs with a learning rate of 0.1 were used for all layers during fine-tuning. Training of the environment identification model architecture was almost the same as for the DAEs. The number of hidden layers is 5, the number of units in each layer is 1024, and 20 epochs were used for all layers during fine-tuning. The number of classes (the number of units at the final softmax layer) is 6, which is determined by the number of environments in the training data (i.e., Room 1, Room 2, and Room 3, each with near and far conditions). The MSLP algorithm generates an inverse filter through the prediction coefficients to estimate the inverse system [12]. We estimated the late reverberation components using the inverse filter and applied dereverberation by power spectral subtraction. For MSLP-based dereverberation, the step size and the order of linear prediction were set to 500 and 750, respectively. We used MSLP to estimate the late reverberation of both the training and test data with the same parameters as for MSLP-based dereverberation.
We used a subspace GMM with maximum mutual information-based discriminative training (MMI-SGMM) [43] and a cross-entropy training DNN for the acoustic model. The KALDI toolkit [44] was used as a decoder for LVCSR. In this study, the numbers of hidden layers and units were set to 3 and 1024 for the DNN acoustic model. The final results were obtained from a confusion network combination of MMI-SGMM with 7000 states and DNN-HMM with 2500 states. Details can be found in [44]. Standard Wall Street Journal 5000-word trigram language model was used for decoding. We used word error rate (WER) to evaluate the speech recognition performance for each method.

Results of environment-dependent DAE
In this section, we compare the following four dereverberation methods:  Tables 3 and 4 show the speech recognition results for each method on the Dev. and Eval. datasets, respectively. For the Dev. dataset, DAE-based cepstral-domain dereverberation shows a remarkable improvement when compared with CMVN-and MSLP-based dereverberation.  The DAE works especially well with strong reverberation, i.e., far-field microphone in "Room 2" and "Room 3" of SimData. The cepstral domain environment-independent DAE outperformed CMN and MSLP under almost all conditions. Although the performance of the two-step environment-dependent DAE is better than the conventional environment-independent DAE in some environments, it is worse in some environments. The reason may be that the performance of environment identification and environment special DAE depended on the training data, environment label, and also the environment of test data. The proposed one-step environment-dependent DAE (that is, reverberation-aware DAE) outperformed the conventional environment-independent DAE and two-step environment-dependent DAE on both SimData and Real-Data in the Dev. and Eval. datasets using SGMM, DNN, and SGMM+DNN. The improvement of the one-step environment-dependent DAE under large reverberation conditions is greater than that under small reverberation conditions. The reason is that the conventional DAE is not effective enough when there are large environmental mismatches between the training and test conditions, and the one-step environment-dependent DAE can reduce the influence of mismatch by estimating the late reverberation and adding it to the input DAE. For SimData in the Dev. dataset, using the one-step environment-dependent DAE with reverberation features estimated by MSLP, the average WER was reduced from 6.36 % with the conventional DAE to 5.77 % using SGMM+DNN, i.e., a relative error reduction rate of 9.28 %. For RealData in the Dev. dataset, the average WER was reduced from 27.46 % with the conventional DAE to 26.66 % using SGMM+DNN,  i.e., a relative error reduction rate of 2.91 %. For the Eval. dataset, a similar trend was observed. The proposed one-step environment-dependent DAE (that is, reverberation-aware DAE) with reverberation features estimated by MSLP outperformed all the other methods. For SGMM+DNN acoustic model, compared with the conventional DAE, relative error reduction rates of 9.97 % for SimData and 12.21% for RealData were achieved. The results show that the proposed one-step environmentdependent DAE is also robust to variations of speaker and speech context.

Comparison of different environment identification models
In this section, we investigate the effect of the environment identification method for the two-step environmentdependent DAE. We compare the performance of the environment identification model, using and not using reverberation estimated by MSLP, for training. Table 5 shows the speech recognition results on the Dev. dataset for these two methods. The results are based on a system combination of MMI-SGMM and DNN. Bigram training was used for the language model. These results indicate that the performance of environment identificat7ion is improved by using estimated reverberation in training the DNN. By blindly using reverberation estimated by MSLP, the DNN can identify an unknown test environment precisely. Without using the estimated reverberation, the two-step environmentdependent DAE performs worse than the conventional environment-independent one owing to poor environment identification performance. That is to say, the two-step environment-dependent DAE is sensitive to the environment identification performance.

Comparison with results of the other participants in the REVERB-challenge
We compared our results with those of the other participants under the same conditions for the training data and language model. A single-channel dataset provided by the REVERB-challenge was used. Table 6 shows the speech recognition results using the trigram language model for each participant. The WER of Alam et al. [45] was 11.1 % on SimData and 32.4 % on RealData. Tachioka et al. [46] achieved a WER of 10.05 % on SimData and 28.06 % on RealData. In our study, for the Eval. dataset, WER was 7.04 % on SimData and 28.66 % on RealData using MMI-SGMM, and 6.41 % on SimData and 26.83 % on RealData using SGMM+DNN.
The results indicate that the performance of our proposed environment-dependent DAE is better than almost all the other participants' methods using the same training data and language model.

Conclusions
In this paper, we proposed two environment-dependent DAE for robust distant-talking speech recognition. The proposed method was evaluated using simulated and real distant-talking speech. DAE-based cepstral-domain dereverberation achieved a remarkable improvement compared with CMN-and MSLP-based dereverberation in both environments. Furthermore, speech recognition performance was improved by the environment-dependent DAE compared with the conventional environment-independent DAE. For SimData in the Eval. using the one-step environment-dependent DAE with reverberation features estimated by MSLP, the average WER was reduced from 7.12 % with the conventional DAE to 6.41 % using SGMM+DNN, i.e., a relative error reduction rate of 9.97 %. For RealData in the Eval. dataset, the average WER was reduced from 30.56 % with the conventional DAE to 26.83 % using SGMM+DNN, i.e., a relative error reduction rate of 12.21 %. The results of our proposed dereverberation method are better than almost all of those of the other participants in the REVERB-challenge for single-channel speech and trigram language model conditions. Endnote 1 W i and W i T 1 correspond to f L in Eq. 1.

Competing interests
The authors declare that they have no competing interests.