Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

EURASIP Journal on Advances in Signal Processing

Table 1 Performance of speech enhancement methods using single-channel input on simulated and real room recordings. The results are averaged over near and far test cases

Processing				Simulated rooms							Real
				CD		SRMR	LLR		SNR		SRMR
Name	Architecture	CMN	-	mean	med.	mean	mean	med.	mean	med.	mean
None	-	No		3.97	3.69	3.68	0.58	0.51	3.62	5.39	3.18
SS	-	No		3.82	3.51	4.0	0.56	0.51	4.75	6.89	3.94
Effect of input context size and hidden layer size
DNN1	11x257-2048-2048-2048-257	Yes		2.64	2.41	5.78	0.52	0.48	7.19	8.09	4.54
DNN2	15x257-3072-3072-3072-257	Yes		2.53	2.31	5.80	0.51	0.47	7.54	8.31	4.36
DNN3	19x257-3072-3072-3072-257	Yes		2.50	2.28	5.77	0.50	0.47	7.55	8.35	4.36

“mean” represents the mean value of the scores of utterances, while “med.” represents the median of the scores. “CMN” indicates whether utterance-wise mean normalization is applied during both DNN training and testing. A DNN architecture of “11x257-2048-2048-2048-257” means that 11 frames of 257D log-magnitude spectrum are used as input, followed by three hidden layers each with 2048 nodes, and the output layer predicts the 257D log-magnitude spectrum of clean speech