Skip to main content

Table 2 Performance of speech enhancement methods using eight channel inputs on simulated and real room recordings. The results are averaged over near and far test cases

From: Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

Processing

Simulated rooms

Real

 

CD (dB)

SRMR

LLR

SNR (dB)

SRMR

Name

Target /train

Estimation

CMN

mean

med.

mean

mean

med.

mean

med.

mean

MVDR

-

-

No

3.64

3.28

4.85

0.48

0.43

5.31

7.76

4.12

DS

-

-

No

3.11

2.76

4.34

0.410

0.360

6.60

9.24

3.84

DS+SS

-

-

No

3.00

2.67

4.56

0.410

0.360

7.16

9.68

4.62

Effect of beamforming

 

MVDR+DNN4

257/random

-

Vis/Tgt

2.28

2.07

5.88

0.470

0.440

8.44

8.88

4.51

DS+DNN4

257/random

-

Vis/Tgt

2.01

1.85

5.92

0.467

0.440

8.56

8.88

4.40

Effect of CMN

 

DS+DNN5

257/random

-

No

2.15

1.96

4.80

0.205

0.155

12.07

12.55

4.95

DS+DNN6

257/random

-

Vis

2.18

2.00

5.27

0.235

0.198

11.11

11.49

4.24

DS+DNN4a

257/random

-

Vis/Tgt2

2.02

1.86

5.16

0.278

0.237

10.93

11.55

3.76

Effect of dynamic features

 

DS+DNN7

3x257/random

Use static

No

2.15

1.96

4.88

0.205

0.158

12.04

12.59

4.95

DS+DNN7LS

3x257/random

LS (9)

No

2.07

1.90

4.83

0.195

0.150

12.42

12.99

4.78

DS+DNN8

257-3x257/seq.

Use static

No

2.04

1.86

4.61

0.193

0.147

12.64

13.22

4.62

Effect of clean phase

 

DS+DNN8c

257-3x257/seq.

Use static

No

1.84

1.66

4.90

0.165

0.123

13.37

13.88

-

  1. “mean” represents the mean value of the scores of utterances, while “med.” represents the median of the scores. “random” means that the frames of each minibatch of DNN training are randomly selected from the whole training corpus, while “seq.” refers to the case in which each utterance is used as a minibatch. An output size of “3x257” means that the static, delta, and acceleration spectra are all predicted. The target size of “257-3x257” refers to the sequential training in Section 4.4.2. All DNNs use 15 frames of input and 3072 nodes per hidden layer. DNN8c is the same as DNN8 except that clean phase is used. Best results of each metric (not including DNN8c) are shown in italics