Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

EURASIP Journal on Advances in Signal Processing

Table 2 Performance of speech enhancement methods using eight channel inputs on simulated and real room recordings. The results are averaged over near and far test cases

“mean” represents the mean value of the scores of utterances, while “med.” represents the median of the scores. “random” means that the frames of each minibatch of DNN training are randomly selected from the whole training corpus, while “seq.” refers to the case in which each utterance is used as a minibatch. An output size of “3x257” means that the static, delta, and acceleration spectra are all predicted. The target size of “257-3x257” refers to the sequential training in Section 4.4.2. All DNNs use 15 frames of input and 3072 nodes per hidden layer. DNN8c is the same as DNN8 except that clean phase is used. Best results of each metric (not including DNN8c) are shown in italics