Skip to main content

Table 6 Results for the evaluation set using an acoustic model trained with extended training data

From: Strategies for distant speech recognitionin reverberant environments

  Proc. LM Adap. SimData RealData
     Room1 Room2 Room3 Ave. Room1 Ave.
     Near Far Near Far Near Far - Near Far -
  Clean/Headset mic RNN - - - - - - 3.5 - - 6.1
  Lapel mic RNN - - - - - - - - - 7.3
0 - a Distant TRI - 5.0 5.7 6.3 11.1 7.2 11.2 7.8 25.8 26.5 26.1
b    4.5 5.4 6.1 9.9 7.0 10.2 7.2 19.7 22.1 20.9
c   RNN - 4.2 5.0 5.3 9.2 5.7 9.7 6.5 23.0 24.4 23.7
d    3.8 4.9 4.9 8.1 5.6 8.6 6.0 18.5 19.9 19.2
I - a WPE (1ch) TRI - 4.8 5.4 6.2 8.8 6.1 8.6 6.6 20.3 20.3 20.3
b    4.7 5.0 5.9 8.2 5.9 8.1 6.3 16.8 17.3 17.0
c   RNN - 3.8 4.3 4.7 7.6 5.3 7.4 5.5 19.2 18.7 19.0
d    3.6* 3.9* 4.4* 6.6* 4.8* 6.8* 5.0* 15.2* 16.7* 15.9*
II - a WPE (2ch) TRI - 4.9 5.1 6.0 7.2 6.0 7.3 6.1 18.0 18.1 18.1
b    4.6 4.9 5.8 6.8 5.8 6.7 5.8 14.5 16.0 15.2
c   RNN - 4.0 4.3 4.9 6.1 5.0 5.8 5.0 16.5 16.5 16.5
d    3.7* 4.0* 4.4 5.7 4.7 5.4 4.6 13.4 13.1 13.2
III - a II + MVDR TRI - 4.9 5.1 5.8 6.8 5.7 6.7 5.8 15.6 15.9 15.8
b    4.6 4.9 5.4 6.2 5.6 6.6 5.5 12.9 14.2 13.5
c   RNN - 4.2 4.4 4.2 5.5 4.9 5.4 4.8 14.3 14.8 14.6
d    3.9 4.2 4.1* 5.0* 4.5 5.0* 4.4* 11.6 12.8 12.2
IV - a III + DOL TRI - 4.8 5.1 5.6 6.5 5.8 6.4 5.7 15.8 15.9 15.8
b    4.7 5.0 5.4 6.1 5.6 6.2 5.5 12.8 14.0 13.4
c   RNN - 4.1 4.7 4.2 5.2 4.9 5.4 4.7 14.4 14.6 14.5
d    3.7 4.1 4.3 5.0* 4.4* 5.2 4.4* 11.1* 12.7* 11.9*
V - a WPE (8ch) TRI - 4.6 5.1 5.8 6.5 5.8 6.8 5.8 16.7 17.0 16.9
b    4.42 4.9 5.8 6.2 5.5 6.5 5.5 13.5 13.9 13.7
c   RNN - 3.8 4.2 4.6 5.5 4.8 5.4 4.7 16.0 15.7 15.8
d    3.5* 3.9* 4.1 5.0 4.3 5.3 4.3 12.7 13.1 12.9
VI - a V + MVDR TRI - 4.8 5.2 5.2 5.5 5.5 5.9 5.3 11.9 12.4 12.2
b    4.6 4.9 5.0 5.2 5.5 5.8 5.2 10.0 10.2 10.1
c   RNN - 4.0 4.2 3.8 4.3 4.4 4.8 4.2 10.4 11.9 11.1
d    3.7 3.9* 3.5* 4.2* 4.2* 4.9 4.1* 9.0 9.6 9.3
VII - a VI + DOL TRI - 5.0 5.3 5.2 5.5 5.7 5.8 5.4 11.1 12.3 11.7
b    4.8 4.8 5.0 5.3 5.6 5.6 5.2 9.7 9.9 9.8
c   RNN - 4.1 4.2 3.9 4.3 4.3 4.8 4.3 10.0 11.4 10.7
d    3.9 4.0 3.7 4.2* 4.3 4.6* 4.1* 8.9* 9.3* 9.1*
  1. *Best performance for 1ch, 2ch and 8ch
  2. The results are presented for the different components of the SE front-end and for different configurations of the ASR back-end, such as the language model (LM) used (trigram (tri) or RNNLM (RNN)) or with () or without (-) adaptation