3.1 Overall learning framework
Figure 4 shows the overall procedure based on DNN-GRU model, which includes the training phase and enhancement phase. Before training, a variety of LPS features for noisy speech and clean speech are extracted. In the training phase, two-stage speech enhancement neural network with nonlinearities is adopted, which can learn mapping from noisy speech features to clean speech features. Firstly, LPS features of the noisy speech and clean speech are inputted to a fully connected feed-forward DNN to obtain the optimal weights, bias and hyper-parameters. Then, the LPS features of DNN pre-processed and noisy speech are combined to compensate the missing time-series information. Lastly, the new LPS speech features and the LPS features of clean speech are used to build the mapping function of GRU network to achieve noise reduction. In the enhancement stage, the noisy speech is sent into the well-trained DNN-GRU model to predict the LPS features of clean speech. The estimated LPS feature is used as waveform recovery to obtain the clean speech. The enhanced speech by the DNN-GRU model is coherent, which guarantees the contextual information of the speech signal and improves the speech intelligibility and quality.
In Fig. 4, \(Y\left( m \right)\) is the noisy speech, \(Y^{{{\text{LPS}}}}\) is the LPS features of noisy speech, \(X^{{{\text{LPS}}}}\) is the LPS features, \(X^{R}\) is the estimated speech, and \(\angle {\text{Y}}^{{\text{R}}}\) is the phase of speech.
3.2 DNN-GRU model-based training
Clean speech and noise are added to construct noisy speech. The clean speech and noise form voice pair datasets which are divided into training sets and test sets.
$$Y\left( m \right) = X\left( m \right) + N\left( m \right)$$
(7)
where \(Y\left( m \right)\), \(X\left( m \right)\) and \(N\left( m \right)\) represent noisy speech, clean speech and noise at time \(m\), respectively.
In the LPS domain, the target values of different frequency bins are predicted independently without any correlation constraint, and can be transformed back to the waveform domain without any information loss. The extraction process of LPS features is as follows.
First, the speech signal is decomposed into 25 ms frames with 10 ms frame shift by pre-processing as shown in Eq. (8). Each frame is smoothed with hamming window.
$$Y_{t} \left( n \right) = \mathop \sum \limits_{p = n - L + 1}^{n} y\left( p \right)w\left( {n - p} \right)$$
(8)
where \(Y_{t} \left( n \right)\) is the \(t\)-th frame speech signal, and \(t\) is the sample point of \(Y_{t} \left( n \right)\). \(L\) is the frame length, and \(p\) denotes the window length. A discrete Fourier transform (DFT) is performed on \(Y_{t} \left( n \right)\) to obtain the spectrum of each frame as shown in Eq. (9):
$$Y\left( {t,f} \right) = \mathop \sum \limits_{n = 0}^{N - 1} Y_{t} \left( n \right)e^{{ - j\frac{2\pi }{N}fn}} \left( {f = 0,1,2 \cdots N - 1} \right)$$
(9)
$$Y^{{{\text{LPS}}}} \left( {t,f} \right) = \log ([Y\left( {t,f} \right)])^{2}$$
(10)
where \(f\) represents the \(f\)-th frequency point at time-frame unit \(t\), and \(N\) is the number of DFT points. The LPS features are obtained by logarithmic function which can be compressed as follows:
3.3 DNN-GRU model
The sequence of the noisy LPS features are used as input of the established DNN-GRU model. The DNN-GRU model for speech enhancement contains 8 layers, which consists of an input layer, three hidden layers of DNN with a sequencing size of 1024–1024–1024, one feature fusion layer with size of 512, two GRU layers and one output layer. To capture the nonlinear variations of data, the SeLU is selected as the activation function in the hidden layers of DNN. The structure of DNN-GRU model is shown in Fig. 5).
Firstly, a DNN with three hidden layers is typically used to learn the mapping between the local LPS features of noisy speech and clean speech to estimate the clean LPS features from the noisy ones in the first stage.
$$Y\left( t \right) = \left\{ {\begin{array}{*{20}c} {y\left( {t - \tau } \right),} & {y\left( {t - \tau + 1} \right) , L ,} & {y\left( {t + \tau } \right)} \\ \end{array} } \right\}$$
(11)
$$X^{p} \left( t \right) = x_{t + k} |_{k = - \tau }^{r} = f^{{{\text{DNN}}}} (X_{t} |\theta ),\tau \in \left( {1,X^{R} \left( t \right)} \right)$$
(12)
where \(Y_{t} \in R^{N}\) denotes the noisy LPS vector, \(\left\{ {x_{t + k} } \right\}_{k = - \tau }^{\tau } \in R^{N}\) is the enhancement LPS vectors, \(k\) is the front-end frames, and \(f^{{{\text{DNN}}}} (Y_{t} |\theta )\) means the DNN-based function that directly maps the noisy LPS features to clean ones, with DNN parameter set to \(\theta\).
The standard back-propagation (BP) algorithm has the ability to address dropout regularization. The DNN training adopts dropout regularization to overcome over-fitting, which randomly discards the neurons with a certain probability to prevent complex correlation among hidden neurons. The mini-batch stochastic gradient descent is a simple but effective method; it also is used to solve the problem of the over-fitting in a large scale of deep network widely. The dropout rate is set as 0.25 in this paper. In the training stage, a linear activation function is used for the output layer. The number of iterations of the standard BP algorithm is 100. The mean squared error (MSE) is used as the loss function, which minimizes the error between the predicted and noisy speech features.
$${\text{MES}} = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{L} \left( {X^{{{\text{LPS}}}} \left( t \right) - X^{R} \left( t \right)} \right)^{2} }}{L}}$$
(13)
where \(L\) is the total number of samples, \(X^{{{\text{LPS}}}} \left( t \right)\) denotes \(t\)-th clean LPS features, and \(X^{R} \left( t \right)\) represent the predicted LPS features.
Adam optimizer is used to update the weights and biases of hidden neurons in mini-batches. Furthermore, the rest of hyper-parameters including learning rate, the number of layers and hidden neurons depends on different conditions. As described above, if training data is diverse and large enough, the DNN-GRU model has the potential to learn the nonlinear relationship between noisy speech and clean speech without any prior knowledge.
Secondly, to capture the effective contextual information in features, the layer of feature fusion is adopted. As shown in Fig. 6, DNN-GRU has a cascade architecture consisting of a prior NN (DNN) and a posterior NN (GRU-NN) for the first and second stage of DNN-GRU.
In Fig. 6, \(x^{p} \left( {t - 1} \right)\), \(x^{p} \left( t \right)\) and \(x^{p} \left( {t + 1} \right)\) are the LPS features of three frames after the first stage of DNN, respectively. \(y\left( {t - 1} \right)\), \(y\left( t \right)\) and \(y\left( { t + 1} \right)\) are the LPS feature of noisy ones. \(Y\left( t \right)\) and \(X^{p} \left( t \right)\) are added and expanded in the form of Fig. 6, forming \(Y^{*} \left( t \right)\). Input the \(Y^{*} \left( t \right)\) into the GRU network for the second stage.
Since the noisy speech contains the time-series information, the combined features are expected from the LPS features of noisy and the LPS features of DNN processing. The new feature frames are combined with the noisy speech frame as follows:
$$Y^{*} \left( t \right) = \left( {y_{t + k + i}^{*} |_{k = - \tau }^{\tau } } \right)|_{i = - \tau }^{\tau } = X^{p} \left( t \right) \cup Y\left( t \right)$$
(14)
where \(X^{p} \left( t \right)\) includes all base predictions for \(x^{p} \left( t \right) \in R^{N}\), and \(Y^{*} \left( t \right)\) containing 128 LPS vectors is input into the GRU network. \(i\) is the front-end frames of noisy speech.
The new LPS features of time instance \(t_{k} ,t_{k - 1} ,L ,t_{k - n}\) (where \(k\) is the current time instance and \(n\) is the number of prior frames) are fed into the GRU network with two GRU layers. The first GRU layer has 1024 cells, which encode the input and pass its hidden state to the second GRU layer, which has 512 cells. The two GRU layers are used to establish the mapping from the new feature to the training target features to achieve the whole frames speech enhancement, and meanwhile preserving the contextual information of speech. The GRU network output \(x^{R} \left( t \right)\) is the estimated \(X^{R} \left( t \right)\).
$$\begin{aligned} X^{R} \left( t \right) & = g^{{{\text{GRU}}}} ( \cdot |\eta ) \\ & = \left\{ {x^{R} \left( {t - \tau } \right),x^{R} \left( {t - \tau + 1} \right),L,x^{R} \left( {t + \tau } \right)} \right\} \\ \end{aligned}$$
(15)
where \(g^{{{\text{GRU}}}} ( \cdot |\eta )\) means the GRU network-based function that directly maps the new features \(Y^{*} \left( t \right)\) to clean ones, with GRU network parameter set to \(\eta\).
3.4 DNN-GRU model-based enhancement
Firstly, the noisy speech is pre-processed in the enhancement stage to obtain a satisfactory enhancement effect. Secondly, the LPS features of noisy speech are extracted and fed into the well-trained DNN-GRU model as test data. To fully display the complementarity of a target set and reduce the impact of network misestimating on enhanced speech, we adopt the estimated LPS to reconstruct enhanced waveform.
Through the DNN-GRU model testing, the estimated LPS feature of the obtained clean speech is defined as \(X^{{{\text{LPS}}}} \left( {n,k} \right)\). Lastly, the reconstructed spectra \(X^{R} \left( {n,k} \right)\) can be calculated as
$$X^{R} \left( {n,k} \right) = \exp \left\{ {X\left( {n,k} \right)/2} \right\}\exp \left\{ {j\angle Y^{R} \left( {n,k} \right)} \right\}$$
(16)
where \(\angle Y^{R} \left( {n,k} \right)\) denotes the \(k\)-th phase of the \(n\)-th frame from the original noisy speech. After above operations, a frame of clean speech is derived by inverse discrete Fourier transform (IDFT) from the current frame spectra and the whole waveform can be reconstructed.