Strategies for distant speech recognitionin reverberant environments

EURASIP Journal on Advances in Signal Processing

Table 5 Results for the evaluation set using an acoustic model trained with the baseline multi-condition training data set

	Proc.	LM	Adap.	SimData							RealData
				Room1		Room2		Room3		Ave.	Room1		Ave.
				Near	Far	Near	Far	Near	Far	-	Near	Far	-
0 - a	Distant	TRI	-	6.1	6.7	7.4	11.2	7.8	12.2	8.6	29.2	27.8	28.5
b			✓	5.6	6.5	7.2	10.2	7.5	11.3	8.0	23.9	23.3	23.6
c		RNN	-	5.2	6.0	5.8	9.9	6.4	9.9	7.2	26.4	26.4	26.4
d			✓	4.7	5.8	6.0	9.5	6.7	10.1	7.1	21.6	22.3	22.0
I - a	WPE (1ch)	TRI	-	6.2	6.4	6.9	9.9	7.0	9.5	7.7	25.4	25.0	25.2
b			✓	5.4	5.8	6.2	9.1	6.4	8.6	6.9	20.1	20.2	20.1
c		RNN	-	5.1	5.5	5.2*	8.1	5.7	7.9	6.3	23.7	23.1	23.4
d			✓	4.7*	5.2*	5.4	8.0*	5.7*	7.7*	6.1*	18.6*	18.9*	18.7*
II - a	WPE (2ch)	TRI	-	6.3	6.3	6.7	8.5	6.7	8.4	7.1	23.5	22.9	23.2
b			✓	5.5	6.1	6.4	8.0	6.2	7.7	6.6	18.3	18.3	18.3
c		RNN	-	5.1	5.4	5.4	7.0	5.7	6.8	5.9	21.4	21.6	21.5
d			✓	4.7*	5.1	5.4	7.0	5.9	7.3	5.9	17.7	16.9	17.3
III - a	II + MVDR	TRI	-	6.3	6.6	6.5	7.3	6.1	7.4	6.7	21.1	20.5	20.8
b			✓	5.5	5.8	5.9	7.2	5.8	7.2	6.2	16.8	15.9	16.4
c		RNN	-	5.8	5.6	5.1	6.2	5.0	5.8	5.6	18.9	19.0	19.0
d			✓	4.8	5.0*	5.0*	6.3	5.3	6.3	5.4*	16.3	15.2	15.7
IV - a	III + DOL	TRI	-	6.6	6.8	6.4	7.1	5.8	7.0	6.6	20.3	18.7	19.5
b			✓	5.6	6.0	5.9	7.0	5.7	6.9	6.2	16.7	14.7*	15.7
c		RNN	-	5.7	5.5	5.0*	6.1*	4.8*	5.7*	5.5	17.9	18.2	18.1
d			✓	5.0	5.0*	5.0*	6.2	5.1	6.2	5.4*	15.3*	15.7	15.5*
V - a	WPE (8ch)	TRI	-	6.3	6.3	6.9	7.9	6.6	8.2	7.0	22.8	22.1	22.5
b			✓	5.6	5.9	6.4	7.3	6.0	7.6	6.5	17.4	17.1	17.2
c		RNN	-	5.3	5.5	5.1	6.1	5.4	6.9	5.7	21.9	20.5	21.2
d			✓	4.6*	4.9*	5.2	6.2	5.7	6.9	5.6	15.8	15.7	15.7
VI - a	V + MVDR	TRI	-	7.2	7.3	6.1	6.0	6.4	7.1	6.7	15.6	15.6	15.6
b			✓	5.5	6.1	5.6	5.5	5.9	6.5	5.8	12.5	12.7	12.6
c		RNN	-	6.2	6.3	4.8	5.3*	5.2*	5.7*	5.6	14.2	14.5	14.3
d			✓	5.1	5.1	4.6*	5.3*	5.3	5.8	5.2*	11.1	11.1*	11.1*
VII - a	VI + DOL	TRI	-	7.7	7.6	6.7	6.7	7.1	7.2	7.2	14.5	15.7	15.1
b			✓	5.9	6.3	5.8	5.8	6.0	6.6	6.0	12.2	13.9	13.0
c		RNN	-	6.7	6.1	5.6	5.6	5.9	6.4	6.1	12.2	13.9	13.0
d			✓	5.1	5.2	4.7	5.3*	5.4	5.8	5.3	9.8*	12.4	11.1*

*Best performance for 1ch, 2ch, and 8ch
The results are presented for the different components of the SE front-end and for different configurations of the ASR back-end, such as the language model (LM) used (trigram (tri) or RNNLM (RNN)) or with (✓) or without (-) adaptation