Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

EURASIP Journal on Advances in Signal Processing

Table 5 ASR performance (WER) using clean condition training data on the evaluation data

CT	MA	Simulated rooms						Real		Avg.
		Room1A		Room2A		Room3A		Room1
		Near	Far	Near	Far	Near	Far	Near	Far
Single microphone
N	N	19.0	25.6	34.5	69.8	47.1	78.3	80.2	76.6	53.9
Y	N	15.6	20.7	24.2	45.3	30.9	57.5	63.1	62.4	40.0
N	Y	14.1	17.9	21.3	45.1	28.3	59.5	66.4	65.9	39.8
Y	Y	14.5	18.2	21.2	38.8	26.8	50.3	57.3	58.0	35.6
Two microphones, with MVDR
N	N	18.0	23.3	27.7	59.8	40.1	71.2	75.1	73.7	48.6
Y	N	14.5	19.0	20.6	38.8	26.6	51.0	56.5	58.6	35.7
N	Y	13.5	17.0	18.9	36.8	24.5	51.4	58.8	59.3	35.0
Y	Y	13.7	17.4	18.3	33.4	23.3	45.2	51.2	53.1	31.9
Eight microphones, with MVDR
N	N	17.0	21.3	23.6	40.3	30.5	53.2	59.3	58.1	37.9
Y	N	14.3	17.2	18.0	27.9	21.7	36.2	43.1	46.4	28.1
N	Y	13.6	16.4	17.3	26.6	20.1	35.6	44.4	46.1	27.5
Y	Y	13.7	16.2	15.8	24.1	19.5	32.3	38.1	42.6	25.3

CT stands for cross transform, while MA refers to the 256-class based CMLLR model adaptation