Skip to main content

Table 5 Results for the evaluation set using an acoustic model trained with the baseline multi-condition training data set

From: Strategies for distant speech recognitionin reverberant environments

 

Proc.

LM

Adap.

SimData

RealData

    

Room1

Room2

Room3

Ave.

Room1

Ave.

    

Near

Far

Near

Far

Near

Far

-

Near

Far

-

0 - a

Distant

TRI

-

6.1

6.7

7.4

11.2

7.8

12.2

8.6

29.2

27.8

28.5

b

  

5.6

6.5

7.2

10.2

7.5

11.3

8.0

23.9

23.3

23.6

c

 

RNN

-

5.2

6.0

5.8

9.9

6.4

9.9

7.2

26.4

26.4

26.4

d

  

4.7

5.8

6.0

9.5

6.7

10.1

7.1

21.6

22.3

22.0

I - a

WPE (1ch)

TRI

-

6.2

6.4

6.9

9.9

7.0

9.5

7.7

25.4

25.0

25.2

b

  

5.4

5.8

6.2

9.1

6.4

8.6

6.9

20.1

20.2

20.1

c

 

RNN

-

5.1

5.5

5.2*

8.1

5.7

7.9

6.3

23.7

23.1

23.4

d

  

4.7*

5.2*

5.4

8.0*

5.7*

7.7*

6.1*

18.6*

18.9*

18.7*

II - a

WPE (2ch)

TRI

-

6.3

6.3

6.7

8.5

6.7

8.4

7.1

23.5

22.9

23.2

b

  

5.5

6.1

6.4

8.0

6.2

7.7

6.6

18.3

18.3

18.3

c

 

RNN

-

5.1

5.4

5.4

7.0

5.7

6.8

5.9

21.4

21.6

21.5

d

  

4.7*

5.1

5.4

7.0

5.9

7.3

5.9

17.7

16.9

17.3

III - a

II + MVDR

TRI

-

6.3

6.6

6.5

7.3

6.1

7.4

6.7

21.1

20.5

20.8

b

  

5.5

5.8

5.9

7.2

5.8

7.2

6.2

16.8

15.9

16.4

c

 

RNN

-

5.8

5.6

5.1

6.2

5.0

5.8

5.6

18.9

19.0

19.0

d

  

4.8

5.0*

5.0*

6.3

5.3

6.3

5.4*

16.3

15.2

15.7

IV - a

III + DOL

TRI

-

6.6

6.8

6.4

7.1

5.8

7.0

6.6

20.3

18.7

19.5

b

  

5.6

6.0

5.9

7.0

5.7

6.9

6.2

16.7

14.7*

15.7

c

 

RNN

-

5.7

5.5

5.0*

6.1*

4.8*

5.7*

5.5

17.9

18.2

18.1

d

  

5.0

5.0*

5.0*

6.2

5.1

6.2

5.4*

15.3*

15.7

15.5*

V - a

WPE (8ch)

TRI

-

6.3

6.3

6.9

7.9

6.6

8.2

7.0

22.8

22.1

22.5

b

  

5.6

5.9

6.4

7.3

6.0

7.6

6.5

17.4

17.1

17.2

c

 

RNN

-

5.3

5.5

5.1

6.1

5.4

6.9

5.7

21.9

20.5

21.2

d

  

4.6*

4.9*

5.2

6.2

5.7

6.9

5.6

15.8

15.7

15.7

VI - a

V + MVDR

TRI

-

7.2

7.3

6.1

6.0

6.4

7.1

6.7

15.6

15.6

15.6

b

  

5.5

6.1

5.6

5.5

5.9

6.5

5.8

12.5

12.7

12.6

c

 

RNN

-

6.2

6.3

4.8

5.3*

5.2*

5.7*

5.6

14.2

14.5

14.3

d

  

5.1

5.1

4.6*

5.3*

5.3

5.8

5.2*

11.1

11.1*

11.1*

VII - a

VI + DOL

TRI

-

7.7

7.6

6.7

6.7

7.1

7.2

7.2

14.5

15.7

15.1

b

  

5.9

6.3

5.8

5.8

6.0

6.6

6.0

12.2

13.9

13.0

c

 

RNN

-

6.7

6.1

5.6

5.6

5.9

6.4

6.1

12.2

13.9

13.0

d

  

5.1

5.2

4.7

5.3*

5.4

5.8

5.3

9.8*

12.4

11.1*

  1. *Best performance for 1ch, 2ch, and 8ch
  2. The results are presented for the different components of the SE front-end and for different configurations of the ASR back-end, such as the language model (LM) used (trigram (tri) or RNNLM (RNN)) or with () or without (-) adaptation