Skip to main content

Table 1 Performance of speech enhancement methods using single-channel input on simulated and real room recordings. The results are averaged over near and far test cases

From: Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

Processing

Simulated rooms

Real

 

CD

SRMR

LLR

SNR

SRMR

Name

Architecture

CMN

-

mean

med.

mean

mean

med.

mean

med.

mean

None

-

No

 

3.97

3.69

3.68

0.58

0.51

3.62

5.39

3.18

SS

-

No

 

3.82

3.51

4.0

0.56

0.51

4.75

6.89

3.94

Effect of input context size and hidden layer size

 

DNN1

11x257-2048-2048-2048-257

Yes

 

2.64

2.41

5.78

0.52

0.48

7.19

8.09

4.54

DNN2

15x257-3072-3072-3072-257

Yes

 

2.53

2.31

5.80

0.51

0.47

7.54

8.31

4.36

DNN3

19x257-3072-3072-3072-257

Yes

 

2.50

2.28

5.77

0.50

0.47

7.55

8.35

4.36

  1. “mean” represents the mean value of the scores of utterances, while “med.” represents the median of the scores. “CMN” indicates whether utterance-wise mean normalization is applied during both DNN training and testing. A DNN architecture of “11x257-2048-2048-2048-257” means that 11 frames of 257D log-magnitude spectrum are used as input, followed by three hidden layers each with 2048 nodes, and the output layer predicts the 257D log-magnitude spectrum of clean speech