Skip to main content

Table 5 ASR performance (WER) using clean condition training data on the evaluation data

From: Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

CT

MA

Simulated rooms

Real

Avg.

  

Room1A

Room2A

Room3A

Room1

 
  

Near

Far

Near

Far

Near

Far

Near

Far

 

Single microphone

 

N

N

19.0

25.6

34.5

69.8

47.1

78.3

80.2

76.6

53.9

Y

N

15.6

20.7

24.2

45.3

30.9

57.5

63.1

62.4

40.0

N

Y

14.1

17.9

21.3

45.1

28.3

59.5

66.4

65.9

39.8

Y

Y

14.5

18.2

21.2

38.8

26.8

50.3

57.3

58.0

35.6

Two microphones, with MVDR

 

N

N

18.0

23.3

27.7

59.8

40.1

71.2

75.1

73.7

48.6

Y

N

14.5

19.0

20.6

38.8

26.6

51.0

56.5

58.6

35.7

N

Y

13.5

17.0

18.9

36.8

24.5

51.4

58.8

59.3

35.0

Y

Y

13.7

17.4

18.3

33.4

23.3

45.2

51.2

53.1

31.9

Eight microphones, with MVDR

 

N

N

17.0

21.3

23.6

40.3

30.5

53.2

59.3

58.1

37.9

Y

N

14.3

17.2

18.0

27.9

21.7

36.2

43.1

46.4

28.1

N

Y

13.6

16.4

17.3

26.6

20.1

35.6

44.4

46.1

27.5

Y

Y

13.7

16.2

15.8

24.1

19.5

32.3

38.1

42.6

25.3

  1. CT stands for cross transform, while MA refers to the 256-class based CMLLR model adaptation