Skip to main content

Table 5 Results for the evaluation set using an acoustic model trained with the baseline multi-condition training data set

From: Strategies for distant speech recognitionin reverberant environments

  Proc. LM Adap. SimData RealData
     Room1 Room2 Room3 Ave. Room1 Ave.
     Near Far Near Far Near Far - Near Far -
0 - a Distant TRI - 6.1 6.7 7.4 11.2 7.8 12.2 8.6 29.2 27.8 28.5
b    5.6 6.5 7.2 10.2 7.5 11.3 8.0 23.9 23.3 23.6
c   RNN - 5.2 6.0 5.8 9.9 6.4 9.9 7.2 26.4 26.4 26.4
d    4.7 5.8 6.0 9.5 6.7 10.1 7.1 21.6 22.3 22.0
I - a WPE (1ch) TRI - 6.2 6.4 6.9 9.9 7.0 9.5 7.7 25.4 25.0 25.2
b    5.4 5.8 6.2 9.1 6.4 8.6 6.9 20.1 20.2 20.1
c   RNN - 5.1 5.5 5.2* 8.1 5.7 7.9 6.3 23.7 23.1 23.4
d    4.7* 5.2* 5.4 8.0* 5.7* 7.7* 6.1* 18.6* 18.9* 18.7*
II - a WPE (2ch) TRI - 6.3 6.3 6.7 8.5 6.7 8.4 7.1 23.5 22.9 23.2
b    5.5 6.1 6.4 8.0 6.2 7.7 6.6 18.3 18.3 18.3
c   RNN - 5.1 5.4 5.4 7.0 5.7 6.8 5.9 21.4 21.6 21.5
d    4.7* 5.1 5.4 7.0 5.9 7.3 5.9 17.7 16.9 17.3
III - a II + MVDR TRI - 6.3 6.6 6.5 7.3 6.1 7.4 6.7 21.1 20.5 20.8
b    5.5 5.8 5.9 7.2 5.8 7.2 6.2 16.8 15.9 16.4
c   RNN - 5.8 5.6 5.1 6.2 5.0 5.8 5.6 18.9 19.0 19.0
d    4.8 5.0* 5.0* 6.3 5.3 6.3 5.4* 16.3 15.2 15.7
IV - a III + DOL TRI - 6.6 6.8 6.4 7.1 5.8 7.0 6.6 20.3 18.7 19.5
b    5.6 6.0 5.9 7.0 5.7 6.9 6.2 16.7 14.7* 15.7
c   RNN - 5.7 5.5 5.0* 6.1* 4.8* 5.7* 5.5 17.9 18.2 18.1
d    5.0 5.0* 5.0* 6.2 5.1 6.2 5.4* 15.3* 15.7 15.5*
V - a WPE (8ch) TRI - 6.3 6.3 6.9 7.9 6.6 8.2 7.0 22.8 22.1 22.5
b    5.6 5.9 6.4 7.3 6.0 7.6 6.5 17.4 17.1 17.2
c   RNN - 5.3 5.5 5.1 6.1 5.4 6.9 5.7 21.9 20.5 21.2
d    4.6* 4.9* 5.2 6.2 5.7 6.9 5.6 15.8 15.7 15.7
VI - a V + MVDR TRI - 7.2 7.3 6.1 6.0 6.4 7.1 6.7 15.6 15.6 15.6
b    5.5 6.1 5.6 5.5 5.9 6.5 5.8 12.5 12.7 12.6
c   RNN - 6.2 6.3 4.8 5.3* 5.2* 5.7* 5.6 14.2 14.5 14.3
d    5.1 5.1 4.6* 5.3* 5.3 5.8 5.2* 11.1 11.1* 11.1*
VII - a VI + DOL TRI - 7.7 7.6 6.7 6.7 7.1 7.2 7.2 14.5 15.7 15.1
b    5.9 6.3 5.8 5.8 6.0 6.6 6.0 12.2 13.9 13.0
c   RNN - 6.7 6.1 5.6 5.6 5.9 6.4 6.1 12.2 13.9 13.0
d    5.1 5.2 4.7 5.3* 5.4 5.8 5.3 9.8* 12.4 11.1*
  1. *Best performance for 1ch, 2ch, and 8ch
  2. The results are presented for the different components of the SE front-end and for different configurations of the ASR back-end, such as the language model (LM) used (trigram (tri) or RNNLM (RNN)) or with () or without (-) adaptation