Skip to main content

Multi-resolution auditory cepstral coefficient and adaptive mask for speech enhancement with deep neural network

Abstract

The performance of the existing speech enhancement algorithms is not ideal in low signal-to-noise ratio (SNR) non-stationary noise environments. In order to resolve this problem, a novel speech enhancement algorithm based on multi-feature and adaptive mask with deep learning is presented in this paper. First, we construct a new feature called multi-resolution auditory cepstral coefficient (MRACC). This feature which is extracted from four cochleagrams of different resolutions can capture the local information and spectrotemporal context and reduce the algorithm complexity. Second, an adaptive mask (AM) which can track noise change for speech enhancement is put forward. The AM can flexibly combine the advantages of an ideal binary mask (IBM) and an ideal ratio mask (IRM) with the change of SNR. Third, a deep neural network (DNN) architecture is used as a nonlinear function to estimate adaptive mask. And the first and second derivatives of MRACC and MRACC are used as the input of the DNN. Finally, the estimated AM is used to weight the noisy speech to achieve enhanced speech. Experimental results show that the proposed algorithm not only further improves speech quality and intelligibility, but also suppresses more noise than the contrast algorithms. In addition, the proposed algorithm has a lower complexity than the contrast algorithms.

1 Introduction

Over the past several decades, a large number of approaches were proposed to solve the problem of speech enhancement. The traditional methods, such as spectral subtraction [1], wiener filtering [2, 3], minimum mean square error (MMSE) [4], statistical model [5, 6], and wavelet transform [7, 8], make statistical assumptions about the background noise and do not handle properly non-stationary noises, which are very common in our daily life.

With the appearance of the computational auditory scene analysis (CASA), the method based on the auditory scene analysis was applied to the speech enhancement [9]. For example, Zhang et al. proposed a speech enhancement based in CASA [10], which extracted the features and estimated the spectrum in gammatone domain as well as filtered out the noise by IRM. This approach has no hypothesis about noise which makes it fit for handling non-stationary noises and has a better generalization capability to process in a complex noise environment. However, it is difficult to deal with unvoiced speech which will result in a poor perceptual quality.

As the development of the deep learning, DNN has become one of the most popular methods for speech enhancement. The speech enhancement algorithm based on DNN is to learn the complex nonlinear relationship between noisy speech and clean speech [11]. Its deep structures are good at learning the nonlinear relationship between noise and speech and performing better in non-stationary background noise. According to the training target, deep learning-based speech enhancement algorithms can be divided into mapping and masking [11]. Researchers have proposed many speech enhancement algorithms in mapping [12]. For example, in 2014, Weninger et al. proposed a single-channel speech separation with memory-enhanced recurrent neural networks [13]. In this algorithm, a long short-term memory recurrent neural network (LSTM-RNN) was employed as a nonlinear regression function to predict clean speech as well as noise features from noisy speech features, and then a magnitude domain soft mask was constructed from these features. In 2015, Xu et al. extended the DNN-based speech enhancement framework to handle adverse conditions and non-stationary noise types in real situations [14]. In the same time, Huang et al. put forward a joint optimization of masks and deep recurrent neural networks (DRNN) for Monaural Source Separation algorithm [15]. In 2016, Vu et al. also presented a speech enhancement algorithm combining non-negative matrix factorization and deep neural networks [16]. These algorithms mentioned above are all to estimate the amplitude spectrum of the target speech. However, it is very difficult to estimate the amplitude spectrum of the target speech accurately, and these algorithms all have so high algorithm complexity and long-time delay that they cost much on calculation and are not suitable for real-time system. Besides, Li et al. proposed an improved least mean square adaptive filtering (ILMSAF)-based speech enhancement algorithm with DNN and noise classification [17], which introduces an adaptive coefficient of filter’s parameters based on ILMSAF. This algorithm has good performance, but is too complex to be used in practice.

In addition, many researchers regarded the time-frequency masking as the target of deep learning for speech enhancement and proposed some speech enhancement algorithms. For example, Wang et al. presented a speech enhancement system based on deep neural network-support vector machine (DNN-SVM) [18]. In this system, the IBM was the target of the DNN-SVM model. Arun et al. proposed an IRM estimator using deep neural networks for robust speech recognition [19]. In this algorithm, the estimated IRM in the Mel-frequency domain is used to filter out noise from noisy Mel spectrogram. In 2014, Wang et al. used a fixed set of complementary features which include amplitude modulation spectrogram, relative spectral transformed perceptual linear prediction coefficient, Mel-frequency cepstral coefficient (MFCC), and 64-channel gammatone feature [20]. In addition, Chen et al. proposed a new feature called multi-resolution cochleagram (MRCG) [21]. However, the MRCG dimension is so large that the algorithm complexity is very high. In 2015, Tseng et al. took a classification-based approach, where the goal is to estimate an IBM and the sparse non-negative matrix factorization (SNMF) is used to extract features from the noisy speech [22]. In 2016, Yi Jiang et al. developed a DNN parameter mask for binaural reverberant speech segregation [23]. In 2017, Li et al. presented an IRM estimation using deep neural networks for monaural speech segregation in noisy reverberant conditions [24], Zhang et al. presented a multi-target ensemble learning for monaural speech separation [25], and Sun et al. also proposed a multiple-target deep learning for LSTM-RNN-based speech enhancement [26]. In this algorithm, an IRM and a log-power spectral are regarded as the goal of training DNN. But the performance of the above speech enhancement algorithms based on DNN is non-ideal in low SNR environments, and the complexity of these algorithms is very high.

Through the above analysis, a speech enhancement algorithm based on MRACC and DNN is proposed. Firstly, a new feature parameter called MRACC is presented on the basis of the MRCG feature adopted [21]. Secondly, in order to remove the noise, an adaptive mask is constructed. Thirdly, we adopt a DNN model with four hidden layers to estimate an adaptive mask. Finally, the enhanced speech is synthesized by using the estimated adaptive mask and noisy speech. The experimental results show that the proposed speech enhancement algorithm has stronger robustness, better denoising performance, and lower complexity than the contrast algorithm [20, 21].

This paper is organized as follows. In Section 2, the proposed speech enhancement based on MRACC and AM with deep neural network is presented. Simulation experiments are given in Section 3 to illustrate the proposed algorithm performance. Finally, we summarize our work in Section 4.

2 Speech enhancement algorithm with deep neural network

2.1 Time-frequency decomposition

The speech signal is a typical time-varying signal. Its time-frequency decomposition focused on the time-varying spectral features of speech signal components, which decomposes a one-dimensional speech signal into a two-dimensional signal in order to reveal the relationship between these frequency components obtained and time [27]. The gammatone filter is an excellent tool for time-frequency decomposition. It can well simulate the sharp filtering characteristics of the basilar membrane, and it is in accordance with the auditory perception of the human ear [28]. Besides, it is easy to be achieved. So in this paper, the gammatone filter is used to decompose the noisy speech into several sub-band signals (see our previous work [29]). The impulse response of the gammatone filter is as follows:

$$ g\left(t,{f}_{\mathrm{c}}\right)={t}^{l-1}\kern0.28em {e}^{-2\pi {Bf}_{\mathrm{c}}t}\cos \left(2\pi {f}_{\mathrm{c}}t+\varnothing \right)\kern0.28em t\ge 0 $$
(1)

where t represents the sample index; fc is the center frequency for cth channel, which varies from 50 to 8000 Hz; and ф is the initial phase of the gammatone filter. In order to simplify the model, ф is set to 0. l is the filter order. A large number of experiments show that, when l = 4, the filter can well simulate the cochlear filter characteristics. So l equals 4 in this paper. The sampling rate of experimental data is set to16 kHz. In order to better reflect the harmonic characteristics of speech signal in each sub-band signal, the number of filters is determined to be 64 in the proposed algorithm. B(fc) is the bandwidth of each frequency channel, which is defined as:

$$ B\left({f}_{\mathrm{c}}\right)=b\cdot \mathrm{ERB}\left({f}_{\mathrm{c}}\right) $$
(2)

where b is an attenuation factor, the best filter performance can be obtained when b equals 1.019, so b is set to 1.019 in this paper. ERB(fc) represents the equivalent rectangular bandwidth (equivalent rectangle bandwidth (ERB)), and the relationship between the equivalent rectangular bandwidth and the central frequency fc can be described by:

$$ \mathrm{ERB}\left({f}_{\mathrm{c}}\right)=24.7\left(4.37{f}_{\mathrm{c}}/1000+1\right) $$
(3)

where the coefficients 24.7 and 4.37 are the empirical values obtained in the experiment [21].

The expression of the input signal can be expressed as:

$$ x(t)=s(t)+n(t) $$
(4)

where x(t) represents noisy speech signal, s(t) represents clean signal, and n(t) represents noise signal.

x(t) is decomposed into 64 sub-band signals G(t, fc) by 64-channel gammatone filters, as shown in formula (5):

$$ G\left(t,{f}_{\mathrm{c}}\right)=g\left(t,{f}_{\mathrm{c}}\right)\cdot U(t)\cdot x(t) $$
(5)

where U(t) is the unit step function.

Then, each sub-band signal is divided into time-frequency (T-F) units with a 20-ms frame with a 10-ms frame shift. A T-F unit corresponds to a small auditory unit of the noisy speech. It is defined as:

$$ {y}_i\left(t,{f}_{\mathrm{c}}\right)=w(t)\cdot G\left(\left(\left(i-1\right)\cdot inc+t\right),{f}_{\mathrm{c}}\right) $$
(6)

where w(t) is a window function. Compared with the rectangular window, Hamming window can better reflect the frequency characteristic of speech signal, so Hamming window is chosen in this paper. yi(t, fc) is the sub-band T-F unit for cth channel at time frame i; inc is a frame shift.

The power of the auditory filter (cochleagram) of each T-F unit CG(i, fc) is calculated by:

$$ \mathrm{CG}\left(i,{f}_{\mathrm{c}}\right)=\sum \limits_{t=0}^{L-1}{y_i}^2\left(t,{f}_{\mathrm{c}}\right) $$
(7)

2.2 Feature extraction

Good features are crucial to the performance of speech enhancement. In 2014, a feature called MRCG [21] is proposed by Chen et al., which is extracted from four cochleagrams of different resolutions to capture both local information and spectrotemporal context. As we all know, human auditory nonlinearity expands small sounds and compresses large sounds. The MRCG simulates auditory nonlinearity by a log function. Log function can well compress high power intensity but overexpands very small signals to infinity. Considering that noise often occurs in very small energy, log function often emphasizes noise and results in poor noise robustness; in order to avoid overexpansion to very small noise, we use a power function in four cochleagrams, which can better simulate human auditory nonlinearity. In addition, the dimension of MRCG is so large that the computational complexity is very high. In order to reduce the computational complexity, we employ a discrete cosine transform (DCT) operation on the basis of the power compression. The modified MRCG is called MRACC.

2.3 Extraction of MRACC feature

Firstly, the noisy speech x(n) is decomposed by the gammatone filter bank into 64 sub-band signals, and the first 64-channel cochleagram (CG) is calculated with the frame length of 20 ms and frame shift of 10 ms. A power function is applied to the CG of each T-F unit. The mathematical expression of CG1 is:

$$ CG1\left(i,{f}_{\mathrm{c}}\right)=g\left[\mathrm{CG}\left(i,{f}_{\mathrm{c}}\right)\right] $$
(8)

where g() is a power function, g = xa, and in this paper, a = 1/15 as Kim suggested [30].

Similarly, the second 64-channel cochleagram (CG2) is computed with the frame length of 200 ms and frame shift of 10 ms.

The third 64-channel cochleagram (CG3) is derived by averaging CG1 across a square window of 11 frequency channels and 11 time frames centered at a given T-F unit. It can be expressed as:

$$ \mathrm{CG}3\left(i,{f}_{\mathrm{c}}\right)=\sum \limits_{k=c\hbox{-} 5}^{c+5}\sum \limits_{j=i\hbox{-} 5}^{i+5}\left(\mathrm{CG}1\left(i,{f}_{\mathrm{c}}\right)\right)/\left(11\ast 11\right) $$
(9)

The fourth 4-channel cochleagram (CG4) is calculated in a similar way to CG3, except that a square window of 23 frequency channels and 23 time frames is used. It can be shown as:

$$ \mathrm{CG}4\left(i,{f}_{\mathrm{c}}\right)=\sum \limits_{d=c\hbox{-} 11}^{c+11}\sum \limits_{j=i\hbox{-} 11}^{i+11}\left(\mathrm{CG}1\left(i,{f}_{\mathrm{c}}\right)\right)/\left({23}^{\ast }23\right) $$
(10)

The CG1, CG2, CG3, and CG4 are connected to obtain an improved MRCG (IMRCG) feature, which has 64 × 4 dimensions for each time frame. The IMRCG feature is denoted as:

$$ \mathrm{IMRCG}\left(i,{f}_{\mathrm{c}}\right)=\left[\mathrm{CG}1\left(i,{f}_{\mathrm{c}}\right);\mathrm{CG}2\left(i,{f}_{\mathrm{c}}\right);\mathrm{CG}3\left(i,{f}_{\mathrm{c}}\right);\mathrm{CG}4\left(i,{f}_{\mathrm{c}}\right)\right] $$
(11)

The visualization of MRCG feature and the proposed IMRCG feature is given in Figs. 1 and 2, respectively. The left plots features extracted from a white noise mixture at − 5 dB SNR, and the right from the corresponding clean speech.

Fig. 1
figure 1

MRCG feature (log compression)

Fig. 2
figure 2

IMRCG (power compression) feature examples

As shown in Figs. 1 and 2, both the MRCG feature and the IMRCG feature can partially retain spectrotemporal information of speech in noise environment. However, compared with the MRCG feature, the IMRCG feature has a clearer banded structure of speech than the MRCG. Therefore, the IMRCG is more capable of characterizing the difference between speech and noise.

In order to reduce the complexity of the algorithm, we reduce the dimension of the extracted features by a discrete cosine transform (DCT), because the DCT has the ability to aggregate the energy to the low frequency. Therefore, MRACC is obtained by a DCT operation to the IMRCG, which can be defined as follows:

$$ \mathrm{MRACC}\left(i,m\right)={\left(\frac{2}{M}\right)}^{0.5}\sum \limits_{c=1}^M\mathrm{IMRCG}\left(i,{f}_{\mathrm{c}}\right)\cos \left(\frac{\pi m\left(2c-1\right)}{2M}\right) $$
(12)

where MRACC(i, m) denotes the multi-resolution auditory cepstral coefficient of the ith frame of the cth sub-band, M is the number of channels, and M equals 64. m is the feature dimension index. When m > 36, the value of MRACC(i, m) is relatively small, so we retain the coefficient of the first 36 of MRACC(i, m).

2.4 Extraction of dynamic feature

In order to improve the accuracy of the target estimate, dynamic features are extracted from the MRACC, because delta features contain some temporal context. Therefore, the combination of the original and dynamic features can improve the accuracy of the target estimation. This method avoids having to rely on recurrent neural network to get temporal dynamics and reduce algorithm complexity.

The dynamic features (∆MRACC and ∆∆MRACC) are obtained from formulas (13) and (14):

$$ \Delta \mathrm{MRACC}\ \left(i,m\right)=\frac{\sum_{k=1}^Kk\left(\mathrm{MRACC}\ \left(i+k,m\right)-\mathrm{MRACC}\ \left(i-k,m\right)\right)}{\sqrt{2{\sum}_{k=1}^K{k}^2}} $$
(13)
$$ \Delta \Delta \mathrm{MRACC}\left(i,m\right)=\frac{\sum_{k=1}^Kk\left(\Delta M\mathrm{RACC}\left(i+k,m\right)-\Delta \mathrm{MRACC}\left(i-k,m\right)\right)}{\sqrt{2{\sum}_{k=1}^K{k}^2}} $$
(14)

where k is a constant and it is set to 2, which represents the first two frames and the last two frames of the current frame. So, in this paper, the proposed feature v can be defined as:

$$ v\left(i,m\right)=\left[\mathrm{MRACC}\left(i,m\right);\varDelta \mathrm{MRACC}\left(i,m\right);\varDelta \varDelta \mathrm{MRACC}\left(i,m\right)\right] $$
(15)

Figure 3 shows the waveform and spectrogram of an utterance tested based on the proposed MRACC feature and the MRCG feature.

Fig. 3
figure 3

Results sample of speech enhancement based on MRCG and MRACC features. a Waveform of clean speech. b Waveform of noisy speech. c Waveform of enhanced speech with MRCG. d Waveform of enhanced speech with MRACC. e Spectrogram of clean speech. f Spectrogram of noisy speech. g Spectrogram of enhanced speech with MRCG. h Spectrogram of enhanced speech with MRACC

It can be seen from Fig. 3 that the residual noise in enhanced speech based on the MRACC feature is almost as much as the noise of the enhanced speech based on the MRCG feature. But compared with the enhanced speech based on MRCG feature, the enhanced speech based on MRACC feature retains more speech information and is closer to the clean speech. Therefore, the MRACC feature is better than the MRCG feature.

2.5 Deep neural network model

Due to the strong nonlinear mapping ability of DNN, we proposed a DNN-based adaptive mask estimator to calculate the adaptive mask for each T-F unit of the noisy speech. In the training phase, the adaptive mask for each T-F unit of noisy speech in the training data is calculated (see in Section 2.4) and used as the training target to train the DNN. Then, in the test phase, the adaptive mask is estimated by the trained DNN with MRACC inputs, which is used to synthesize the enhanced speech with noisy speech. The DNN is usually made up of three parts: the input layer, the hidden layer, and output layer. The input layer is the feature vector of the noisy speech, the hidden layer is stacked by the multiple hidden layers, and the output layer is an adaptive mask. The structure of the DNN in this paper is shown in Fig. 4.

Fig. 4
figure 4

The architecture of DNN

The structure of the DNN model constructed in this paper is composed of one input layer, four hidden layers, and one output layer. The proposed MRACC feature is a 432-dimensional vector, so the number of input layer’s neuron is 432. The experimental results show that DNN has the best performance when the hidden layer units are 1024. Therefore, each hidden layer has 1024 rectified linear units (relu), which can improve generalization and avoid gradient disappearance problem. One frame adaptive masking threshold is a 64-dimensional vector, so the number of output layer’s unit is 64. The activation function of the output layer is a sigmoid function. Consequently, the structure of DNN is 432-1024-1024-1024-1024-64.

The training of DNN employs the standard backpropagation (BP) algorithm which couples with dropout regularization. Dropout regularization can overcome the overfitting in DNN training, which discards a certain percentage of the neural units randomly to prevent complex co-adaptation among hidden units, forcing each hidden unit not to rely on each other. In this paper, the dropout rate is 0.2. Besides, no unsupervised pre-training is used. For a large training set, the effect of pre-training will be weakened. The mean squared error (MSE) is used as the loss function in the standard backpropagation algorithm. To improve the MSE function, we use an adaptive gradient descent algorithm along with a momentum term. In the training processing, the number of epochs is 25. For the first five epochs, a momentum rate is set to 0.5, after which the rate increases to 0.9.

IBM is the main computing objective of computational auditory scene analysis. It has been proved to be able to greatly improve speech intelligibility [11], but it seriously damages the quality of speech. Compared with IBM, IRM has a better speech quality, but it has a worse speech intelligibility [18]. Therefore, in order to balance speech quality and intelligibility, we propose an AM as the training target which is adaptively obtained by IBM and IRM according to the noise change. We calculate the energy of each time-frequency unit of speech and noise and obtain the IBM, IRM, and signal-to-noise ratio corresponding to the noisy speech according to Eqs. (17~20) and (22~25). The adaptive masking coefficient (a) is derived by the signal-to-noise ratio, which is used to weight IBM and IRM to get the adaptive mask as the training target for DNN through Eq. (16).

The formula of adaptive mask proposed in this paper is as follows:

$$ \mathrm{AM}\left(i,{f}_{\mathrm{c}}\right)=\left(1-\alpha \left(i,{f}_{\mathrm{c}}\right)\right)\ast \mathrm{IBM}\left(i,{f}_{\mathrm{c}}\right)+\alpha \left(i,{f}_{\mathrm{c}}\right)\ast \mathrm{IRM}\left(i,{f}_{\mathrm{c}}\right) $$
(16)

where IBM(i, fc) denotes the ideal binary mask (IBM) [18]; it can be defined as follows:

$$ \mathrm{IBM}\left(i,{f}_{\mathrm{c}}\right)=\left\{\begin{array}{cc}1& {E}_{\mathrm{s}}\left(i,{f}_{\mathrm{c}}\right)\ge {E}_{\mathrm{n}}\left(i,{f}_{\mathrm{c}}\right)\cdot {10}^{\frac{lc}{10}}\\ {}0& \mathrm{else}\end{array}\right. $$
(17)

Es(i, fc) and En(i, fc) represent the energy of clean speech and noise, respectively. They are calculated by formulas (18) and (19). lc is a threshold and is usually set to 1.

$$ {E}_{\mathrm{s}}\left(i,{f}_{\mathrm{c}}\right)=\sum \limits_{t=0}^{L-1}{s_i}^2\left(t,{f}_{\mathrm{c}}\right) $$
(18)
$$ {E}_{\mathrm{n}}\left(i,{f}_{\mathrm{c}}\right)=\sum \limits_{t=0}^{L-1}{n_i}^2\left(t,{f}_{\mathrm{c}}\right) $$
(19)

IRM(i, fc) is an ideal ratio mask (IRM) [23], which is defined as:

$$ {\mathrm{IRM}}_{\mathrm{gamm}}\left(i,{f}_{\mathrm{c}}\right)={\left(\frac{E_{\mathrm{s}}\left(i,{f}_{\mathrm{c}}\right)}{E_{\mathrm{s}}\left(i,{f}_{\mathrm{c}}\right)+{E}_{\mathrm{n}}\left(i,{f}_{\mathrm{c}}\right)}\right)}^{\beta } $$
(20)

β is an adjustable scale factor, and a large number of experiments show that when β = 0.5, the IRM has the best performance. Therefore, β is set to 0.5.

The adaptive coefficient α(i, fc) is defined as [7]:

$$ \alpha \left(i,{f}_{\mathrm{c}}\right)=\frac{1}{1+\exp \left(-\mathrm{SNR}\left(i,{f}_{\mathrm{c}}\right)\right)} $$
(21)

Here, SNR(i, fc) is a signal-to-noise ratio of each frame, which is calculated by formula:

$$ \mathrm{SNR}\left(i,{f}_{\mathrm{c}}\right)=\frac{y^2\left(i,{f}_{\mathrm{c}}\right)}{n^2\left(i,{f}_{\mathrm{c}}\right)} $$
(22)

y2(i, fc) and n2(i, fc) denote the noisy speech and noise energy of the ith frame and cth sub-band signal, respectively.

Assuming that the first six frames are noise, the noise energy of the remaining five frames except the first frame is calculated by the (Eqs. 23, 24, and 25), which is used as the noise energy of the sixth frame.

$$ {\overline{n}}^2\left(i,{f}_{\mathrm{c}}\right)=\frac{1}{5}\sum \limits_{a=0}^4{n}^2\left(i-a,{f}_{\mathrm{c}}\right) $$
(23)
$$ {y}^2\left(i,{f}_{\mathrm{c}}\right)=\frac{1}{N}\sum \limits_{t=0}^{N-1}{\left({y}_i\left(t,{f}_{\mathrm{c}}\right)\right)}^2 $$
(24)
$$ {n}^2\left(i,{f}_{\mathrm{c}}\right)=\alpha \left(i,{f}_{\mathrm{c}}\right)\times {n_{i-1}}^2\left(t,{f}_{\mathrm{c}}\right)+\left(1-\alpha \left(i,{f}_{\mathrm{c}}\right)\right)\times {y_i}^2\left(t,{f}_{\mathrm{c}}\right) $$
(25)

where ‾n(i, fc) is the initial noise energy, N is the number of sampling point in one frame and is set to 320, and a is the frame index.

Figure 5 shows the waveform and spectrogram of an utterance tested based on IBM, IRM, and the proposed adaptive mask. In Fig. 5, compared with the enhanced speech with IRM, the enhanced speech with IBM has less noise; however, the quality of the enhanced speech is poor. The enhanced speech with IRM can keep more speech information but the intelligibility of the enhanced speech. Through analyzing the advantages and disadvantages of IBM and IRM, we proposed an adaptive mask combining with IBM and IRM. The enhanced speech with adaptive mask not only has less residual noise but also retains speech information well. So, the proposed adaptive mask is outperformed than IBM and IRM.

Fig. 5
figure 5

Speech enhancement effect samples with − 5 dB white noise. a Waveform of clean speech. b Waveform of noisy speech. c Waveform of enhanced with IBM speech. d Waveform of enhanced speech with IRM. e Waveform of enhanced speech with the proposed adaptive mask. f Spectrogram of clean speech. g Spectrogram of noisy speech. h Spectrogram of enhanced speech with IBM. i Spectrogram of enhanced speech with IRM. j Spectrogram of enhanced speech with the proposed adaptive mask

2.6 Algorithm implementation steps

The block diagram of complementation steps of the proposed algorithm is shown in Fig. 6. Figure 6 shows the processing pipeline of the proposed speech enhancement algorithm. In the training phase, we calculate the energy of each T-F unit of speech and noise and obtain the IBM, IRM, and signal-to-noise ratio corresponding to the noisy speech. The adaptive masking coefficient (a) is derived by the signal-to-noise ratio, which is used to weight IBM and IRM to get the adaptive mask (AM) as the training target for DNN. Then, the MRACC features of noisy speech are extracted as the inputs for deep learning. We train the DNN model and save the weights and thresholds of the DNN model after the training is completed. In this paper, the DNN architecture is 432-1024-1024-1024-1024-64. In the test phase, the MRACC feature vector of test sample is entered in the trained DNN network model to obtain an estimated adaptive mask, then the enhanced speech is synthesized by using the test sample and the estimated adaptive mask.

Fig. 6
figure 6

The principle block diagram of speech enhancement algorithm

3 Results and discussions

3.1 Experimental data

In the experiment, clean utterances come from the NTT corpus. The sampling rate of data is set to 16 kHz. Three kinds of clean utterances are selected from the NTT corpus, including English, Chinese, and French. Each language library contains 96 sentences, which are produced by 8 speakers (4 male and 4 female speakers, and 12 utterances for each speaker). The length of each sentence is 8 s. Therefore, there are (96 × 3) 288 clean utterances. For each language, 76 clean sentences are randomly selected as the training data, and the remaining 20 sentences are tested as the test data. There are 17 noise types, namely buccaneer1, buccaneer2, babble, destroyerengine, destroyerops, f16, factory1, factory2, hfchannel, leopard, m109, machinegun, pink, volvo, white, office, and street selected from the NoiseX-92 database. The training set covers the first 15 noise types mentioned above. To evaluate the performance of the proposed algorithm in an unknown noise environment, office and street noise are used as the noise types that are not included in the training set. The 288 clean sentences are corrupted with abovementioned 17 noise types at 4 levels of SNR, i.e., 10 dB, 5 dB, 0 dB, and − 5 dB, to build a multi-condition data set.

In order to verify the effectiveness of the proposed algorithm, we select on training targets for supervised speech separation as the first contrast algorithm [20], and a feature study for classification-based speech separation at very low signal-to-noise ratio is considered as the second contrast algorithm [21].

3.2 Objective performance evaluation

The purpose of this test is to evaluate the performance of our proposed algorithm in complex noise environments. In this test, segment SNR (SegSNR), perceptual evaluation of speech quality (PESQ), log-spectral distortion (LSD), and short-time objective intelligibility (STOI) are adopted as the objective measures of speech quality [31,32,33].

For the 17 noise types, the test results of SegSNR, PESQ, LSD, and STOI are shown in Tables 1, 2, 3, and 4, respectively.

Table 1 The SegSNR of the proposed algorithm and the contrast algorithm
Table 2 The LSD of the proposed algorithm and the contrast algorithm
Table 3 The PESQ of the proposed algorithm and the contrast algorithm
Table 4 The STOI of the proposed algorithm and the contrast algorithm

It can be seen from Table 1 that for leopard noise with SNRs of 0 dB and − 5 dB, the SegSNR of the proposed algorithm is better than that of the contrast algorithm 1, but less than that of the contrast algorithm 2. For babble and m109 noise with SNR − 5 dB, the SegSNR of the proposed algorithm is better than that of the contrast algorithm 1, but less than that of the contrast algorithm 2. Compared with other noises, leopard and m109 noise have more complex time-frequency characteristics and the babble noise is similar to speech, so it is difficult to distinguish between speech and noise. But the average SegSNR under different SNRs of the proposed algorithm is all higher than that of the contrast algorithm. The reason is the MRACC feature in the proposed algorithm contains more phonetic information so that the speech signal will be separated from complex noise environments by DNN. So the proposed algorithm is better than the contrast algorithm in general.

As shown in Table 2, for babble and leopard noise with SNR − 5 dB, the LSD of the proposed algorithm is better than that of the contrast algorithm 1, but is a little weaker than that of the contrast algorithm 2. But for other noise types, compared with the contrast algorithm, the distortions all are reduced. And the average LSD under different SNRs of the proposed algorithm is all better than that of the contrast algorithm. Therefore, the distortion of enhanced speech based on the proposed algorithm is less than that based on the contrast algorithm on the whole.

We can know from Table 3, for 17 kinds of noise, the PESQ of the proposed algorithm is all greater than that of the contrast algorithm. Consequently, for the complex noise environment, the speech quality of enhanced speech based on our proposed algorithm is better than the contrast algorithm.

It can be seen from Table 4 that for babble noise, leopard noise, and m109 noise, the STOI of the proposed algorithm is similar to or slightly less than that of the contrast algorithm. But for other noises, the STOI of the proposed algorithm is a little better than that of the contrast algorithm and the average STOI of the proposed algorithm is all higher than that of the contrast lgorithm in every SNR condition. Therefore, the STOI of the proposed algorithm is slightly greater than the contrast algorithm overall.

3.3 Subjective performance evaluation

In order to test the performance of the proposed algorithm further, A/B test method, MOS (mean opinion score), waveform, and spectrogram are adopted as the subjective measures of speech quality. The A/B test method which is often used for page and process testing can reflect the user’s preference for different versions of the page or process. Therefore, the A/B test method is adopted by this paper to test the subjective performance. Ten testers (five males and five females) are invited to conduct A/B test and MOS on the enhanced speech of the proposed algorithm and the comparison algorithm, respectively.

For the 17 noise types, SNR conditions include − 5 dB, 0 dB, 5 dB, and 10 dB; the test results of A/B test method are summarized in Tables 5, 6, and 7. The MOS score from 0 to 5 indicated that the speech quality is getting better. Table 8 presents the results of MOS at different SNRs across the 17 noise types.

Table 5 The A/B test of the proposed algorithm and the contrast algorithm with 15 noise types
Table 6 The A/B test of the proposed algorithm and the contrast algorithm with street and office noise
Table 7 The A/B test of the proposed algorithm and the contrast algorithm at different SNRs
Table 8 The MOS of the proposed algorithm and the contrast algorithm

We can see from Table 5 for 15 kinds of noise, the A/B test of the proposed algorithm is all higher than that of the contrast algorithm in every noise condition. In Table 6, for office noise and street noise, the A/B test of the proposed algorithm is also better than that of the contrast algorithm. Consequently, for the complex noise environment, the proposed algorithm has the stronger robustness. Therefore, the subjective speech quantity of the proposed algorithm is better than that of the contrast algorithm.

Table 7 shows that the A/B test of the proposed algorithm is higher than that of the contrast algorithm at different SNRs.

As shown in Table 8, for the 17 noise types, the MOS of the proposed algorithm is all higher than that of the compared method. Therefore, the subjective quality of enhanced speech based on the proposed algorithm is greater than that based on the contrast algorithm.

Figure 7 shows the waveform and spectrogram of the proposed algorithm and the contrast algorithm with factory2 noise at SNR = − 5 dB. We can know from Fig. 7, for factory2 noise with the SNR of − 5 dB, the proposed algorithm can eliminate most of the noise to a certain extent. But there is still a lot of noise in the contrast algorithm, which makes the listeners feel annoying. Consequently, the denoising effect of the proposed algorithm is better than that of the contrast algorithm. Therefore, the enhancement effect of the proposed algorithm is greater than that of the contrast algorithm on the whole.

Fig. 7
figure 7

Speech enhancement effect samples with − 5 dB factory2 noise. a Waveform of clean speech. b Waveform of noisy speech. c Waveform of enhanced speech with the contrast algorithm 1. d Waveform of enhanced speech with the contrast algorithm 2. e Waveform of enhanced speech with the proposed algorithm. f Spectrogram of clean speech. g Spectrogram of noisy speech. h Spectrogram of enhanced speech with the contrast algorithm 1. i Spectrogram of enhanced speech with the contrast algorithm 2. j Spectrogram of enhanced speech with the proposed algorithm

According to the analysis of the above test results, we can come to the conclusion that the performances of SegSNR, LSD, PESQ, STOI, A/B test, and MOS of the proposed algorithm are greater than those of the compared method. Moreover, in the low SNR environments, the performance of the proposed algorithm is very excellent. Therefore, the proposed algorithm is more suitable for low SNR environments.

3.4 Algorithm complexity test

In order to test the algorithm complexity of this algorithm, the MATLAB operation time of each algorithm is shown in Table 9 in this paper. Each algorithm deals with all speech signals and then calculates the average length of time it takes for each speech to be processed. It can be seen that the operation time of the proposed algorithm is less than that of the contrast algorithm. After analysis, we can know that there are two reasons. Firstly, the complexity of the extraction process of MRACC feature in the proposed algorithm is lower than that of the contrast algorithm 1. Secondly, compared with the MRCG feature in contrast algorithm 2, the proposed MRACC feature dimension is reduced. Therefore, in a large number of experiments, the operation time of the proposed algorithm is lower than that of the contrast algorithm.

Table 9 The operation time comparison

4 Conclusion

In this paper, a speech enhancement algorithm based on MRACC and adaptive mask with deep learning is proposed. In this algorithm, firstly, a new feature, MRACC, is presented. Compared with the MRCG feature, this feature uses power function instead of log function so that it can capture local information and spectrotemporal contexts, and DCT is employed to gather the power to the low frequency in this feature so that the dimension of feature is reduced according to the power’s distribution. Therefore, the algorithm complexity of the proposed algorithm is reduced. Secondly, an adaptive mask which can track the noise changes is used for speech enhancement. Because the adaptive mask combines the advantages of IRM and IBM, so it has more accurate estimation on the target speech energy ratio with DNN. Thirdly, we adopt a DNN model with four hidden layers to estimate an adaptive mask. DNN has strong nonlinear processing ability, which could well describe the complex nonlinear relationship between noise and speech. So our proposed algorithm has better quality and intelligibility as well as lower algorithm complexity than the contrast algorithm overall.

Abbreviations

AM:

Adaptive mask

BP:

Backpropagation

CASA:

Computational auditory scene analysis

CG:

Channel cochleagram

DCT:

Discrete cosine transform

DNN:

Deep neural network

DNN-SVM:

Deep neural network-support vector machine

DRNN:

Deep recurrent neural networks

IBM:

Ideal binary mask

ILMSAF:

Improved least mean square adaptive filtering

IMRCG:

Improved MRCG

IRM:

Ideal ratio mask

LSD:

Log-spectral distortion

LSTM-RNN:

Long short-term memory recurrent neural network

MFCC:

Mel-frequency cepstral coefficient

MMSE:

Minimum mean square error

MRACC:

Multi-resolution auditory cepstral coefficient

MRCG:

Multi-resolution cochleagram

MSE:

Mean squared error

PESQ:

Perceptual evaluation of speech quality

SegSNR:

Segment SNR

SNMF:

Sparse non-negative matrix factorization

SNR:

Signal-to-noise ratio

STOI:

Short-time objective intelligibility

References

  1. S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)

    Article  Google Scholar 

  2. J.D. Chen, J. Benesty, Y.T. Huang, S. Doclo, New insights into the noise reduction Wiener filter. IEEE Trans. Audio Speech Lang. Process. 14(4), 1218–1234 (2006)

    Article  Google Scholar 

  3. Loizou, Speech Enhancement: Theory and Practice (CRC Press, New York, 2007)

    Book  Google Scholar 

  4. RC. Henddriks, R. Heusdens, J. Jensen, MMSE based noise PSD tracking with low complexity. Proc.IEEE Int. Conf. Acoustics, Speech, Signal Process, 4466–4469 (2010)

  5. A. Ozerov, E. Vincent, F. Bimbot, A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process. 20(4), 1118–1133 (2012)

    Article  Google Scholar 

  6. N. Mohammadiha, P. Smaragdis, A. Leijon, Supervised and unsupervised speech enhancement using non-negative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)

    Article  Google Scholar 

  7. L. Ruwei, B. Changchun, D. Huijing, Speech enhancement using adaptive threshold based on bi-orthogonal wavelet packet decomposition. Chin. J. Sci. Instrum. 29(10), 2135–2140 (2008)

  8. L. Ruwei, B. Changchun, D. Huijing, Speech enhancement algorithm based on wavelet transform. J Data Acquis Proc 24(3), 362–368 (2009)

    MATH  Google Scholar 

  9. D.L. Wang, G.J. Brown, Computational Auditory Scene Analysis: Principles, Algorithms, and Applications (IEEE Press, Piscataway, 2006)

    Book  Google Scholar 

  10. Z. Weiqiang, G. Cong, Z. Qiao, K. Jian, H. Liang, L. Jia, T. Johnson Micheal, A speech enhancement algorithm based on computational auditory scene analysis. J Tian Jin Univ (Sci Technol) 48(8), 663–669. (2015)

  11. L. Wen, J. Nie, S. Liang, S. Zhang, X. Liang, Deep learning based speech separation technology and its developments. Zidonghua Xuebao/acta Automatica Sinica 42(6), 819–833 (2016)

  12. Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett. 21(1), 65–68 (2014)

    Article  Google Scholar 

  13. F. Weninger, F. Eyben, B. Schuller, in IEEE International Conference on Acoustics Speech and Signal Processing. Single-channel speech separation with memory-enhanced recurrent neural networks (IEEE Press, Florence, 2014), pp. 3737–3741

    Google Scholar 

  14. Y. Xu, J. Du, L.R. Dai, et al., A regression approach to soeech enhancement based on deep neural network. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)

    Article  Google Scholar 

  15. P.S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)

    Article  Google Scholar 

  16. T.T. Vu, B. Bigot, E.S. Chng, in IEEE International Conference on Acoustics Speech and Signal Processing. Combining non-negative matrix facorization and deep neural networks for speech enhancement and automatic speech recognition (IEEE Press, Shanghai, 2016), pp. 499–503

    Google Scholar 

  17. R. Li, Y. Liu, Y. Shi, W. Cui, ILMSAF based speech enhancement with DNN and noise classification. Speech Comm. 85, 53–70 (2016)

    Article  Google Scholar 

  18. Y. Wang, D.L. Wang, Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)

    Article  Google Scholar 

  19. A. Narayanan, D.L. Wang, Ideal Ration Mask Estimation on Using Deep Neural Networks for Robust Speech Recognition (IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2013), pp. 1520–6149

  20. Y.X. Wang, A. Narayanan, D.L. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)

    Article  Google Scholar 

  21. J. Chen, Y. Wang, D.L. Wang, in IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). A feature study for classification-based speech separation at very low signal-to noise ratio (2014)

    Google Scholar 

  22. H.-W. Tseng, M. Hong, Z.-Q. Luo, in IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Combing sparse NMF with neural network: a new classification-based approach for speech enhancement (2015)

    Google Scholar 

  23. Y. Jiang, W. Li, Y. Zu, in The 9 th International Congress on Image and Signal Processing BioMedical Engineering and Information (CISP-BMEI2016). A DNN parameter mask for the binaural reverberant speech segregation (2016)

    Google Scholar 

  24. L. Xu, J. Li, Y. Yan, in Interspeech. Ideal ratio mask estimation using deep neural networks for monaural speech segregation in noisy reverberant conditions (2017), pp. 1203–1207

    Google Scholar 

  25. H. Zhang, X. Zhang, G. Gao, in Interspeech. Multi-target ensemble learning for monaural speech separation [C]//INTERSPEECH. 1958-62, 2017

  26. L. Sun, J. Du, L.-R. Dai, C.-H. Lee, Multiple-Target Deep Learning for LSTM-RNN Based Speech Enhancement (Hands-free Speech Communications and Microphone Arrays, HSCMA, 2017)

  27. G. Zhexue, C. Zhongsheng, Matlab Time Frequency Analysis Technology and its Application (People's post and Telecommunications Press, Beijing, 2006)

    Google Scholar 

  28. Y.W. Yang, Y. Jiang, R.S. Liu, et al., in Proc. Signal Processing, Communications and Computing (ICSPCC). A realtime analysis/synthesis Gammatone filterbank (2015), pp. 1–6

    Google Scholar 

  29. R. Li, D. Pan, S. Zhang, Speech enhancement algorithm based on sound source localization and scene matching for binaural digital hearing aids. J. Med. Biol. Eng. (2018). https://doi.org/10.1007/s40846-018-0412-z

  30. C. Kim, Signal Processing for Robust Speech Recognition Motivated by Auditory ProcessingPh.D. dissertation (Carnegie Mellon University, Pittsburgh, 2010)

    Google Scholar 

  31. T. Xiaoheng, Q. Jiwei, Z. Shuai, Objective evaluation method of speech quality based on auditory perceptual properties. J. Southwest Jiao Tong Univ. 48(4), 756–760 (2013)

    Google Scholar 

  32. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19, 2125–2136 (2011)

    Article  Google Scholar 

  33. ITU-T Recommendation P. 862, Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-End Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs (International Telecommunication Union, Geneva, 2001)

    Google Scholar 

Download references

Funding

This work was supported by the Scientific Research Program of Beijing Municipal Commission of education (No. KM201510005007) and the National Natural Science Foundation of China(No.51477028).

Availability of data and materials

Please contact authors for data requests.

Author information

Authors and Affiliations

Authors

Contributions

RL devised the algorithm, checked the experiment, and improved this paper. XS wrote the draft of this paper and did partial simulation experiments. YL programmed the code and did the simulation experiments. DY helped to check the codes. LD improved the English of this paper. All the authors wrote this paper together, and they have read and approved the final manuscript.

Corresponding author

Correspondence to Ruwei Li.

Ethics declarations

Ethics approval and consent to participate

This study does not involve human participants, human data, or human tissue.

Consent for publication

In the manuscript, there is no any individual person’s data.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, R., Sun, X., Liu, Y. et al. Multi-resolution auditory cepstral coefficient and adaptive mask for speech enhancement with deep neural network. EURASIP J. Adv. Signal Process. 2019, 22 (2019). https://doi.org/10.1186/s13634-019-0618-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13634-019-0618-4

Keywords