Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration

Single-channel speech enhancement in highly non-stationary noise conditions is a very challenging task, especially when interfering speech is included in the noise. Deep learning-based approaches have notably improved the performance of speech enhancement algorithms under such conditions, but still introduce speech distortions if strong noise suppression shall be achieved. We propose to address this problem by using a two-stage approach, first performing noise suppression and subsequently restoring natural sounding speech, using specifically chosen neural network topologies and loss functions for each task. A mask-based long short-term memory (LSTM) network is employed for noise suppression and speech restoration is performed via spectral mapping with a convolutional encoder-decoder network (CED). The proposed method improves speech quality (PESQ) over state-of-the-art single-stage methods by about 0.1 points for unseen highly non-stationary noise types including interfering speech. Furthermore, it is able to increase intelligibility in low-SNR conditions and consistently outperforms all reference methods.


Introduction
Speech enhancement is the task of removing interferences from a degraded speech signal and thereby improving the perceived quality and intelligibility of the signal. The research interest in speech enhancement has been consistently high, due to challenges arising with applications such as mobile speech communication systems, hearing aids, and robust speech recognition. This paper focuses on the challenging task of single-channel speech enhancement in non-stationary noise conditions including interfering speech, reflecting the real-world conditions in many of the aforementioned applications.
Classical speech enhancement algorithms typically operate in the short-time Fourier transform (STFT) domain and use a frequency bin-wise gain function, also called weighting rule, which is derived using an optimality criterion under specific model assumptions for the distributions of speech and/or noise [1][2][3][4]. Commonly, estimates of the a (2020) 2020: 49 Page 2 of 26 priori signal-to-noise ratio (SNR) and in turn the noise power are needed for the weighting rule computation. Numerous algorithms exist for the estimation of a priori SNR [1,5,6] and noise power [7][8][9], where the latter ones are generally based on the assumption that noise in a given analysis segment is more stationary than speech [10,11]. This assumption does not hold for highly non-stationary noise types such as speech babble or restaurant noise and therefore classical speech enhancement algorithms often fail to provide good performance in such conditions. With the advent of deep learning, an increasing number of studies using deep neural networks (DNNs) for speech enhancement have shown that these models are able to significantly outperform classical and other machine learning-based methods in terms of speech quality and intelligibility [12][13][14][15][16][17][18][19][20][21]. This is especially true for non-stationary noise conditions, where deep learning-based methods have the advantage of making no assumptions on the stationarity of noise or the underlying distributions of speech and noise. Important aspects of these methods are on the one hand the feature and target representations as well as the loss function used in training and on the other hand the topology of the neural network. We focus on research addressing each of those aspects in the next two paragraphs.
In [13] and [14], the authors use a feedforward DNN to directly map from noisy log-spectral features to the corresponding clean speech features and show that a good generalization to unseen noise types can be achieved by multi-condition training with a large amount of different noise types [14]. A comparison of various target representations for supervised DNN-based speech separation 1 has been conducted in [15] and comes to the result that estimating bounded time-frequency (T-F) ratio masks such as the ideal ratio mask (IRM) is advantageous compared to directly estimating the clean spectrogram. Various types of T-F masks have been further investigated in [16] and [22], where the authors introduce a masked spectrum approximation (MSA) loss that optimizes the mask estimation task in the domain of speech spectra as opposed to using a mask approximation (MA) loss with ideal masks as optimization targets. In addition, a phase-sensitive spectrum approximation (PSA) loss, that takes the phase difference between noisy and clean speech signals into account while still estimating real-valued masks, is introduced, showing advantages over other mask-based targets [22]. A way to fully integrate the joint estimation of clean speech spectral amplitude and phase into mask-based systems is the usage of complex ratio mask (cRM) targets, which perfectly reconstruct the clean signal under ideal estimation conditions ( [17], with early predecessors for speech quality testing [23,24]). A potential drawback of this method is that it uses an MA loss and therefore does not leverage the advantages of optimization in the speech spectral domain.
The models used in these early studies have mostly been feedforward DNNs [12][13][14][15]17], although Weninger et al. [16] have shown that long short-term memory (LSTM) networks, with their ability to model temporal dynamics, have benefits in a speech separation task. An important advantage of LSTMs in comparison to feedforward DNNs is their ability to focus on a target speaker, taking into account long-term temporal dependencies, and therefore suppressing interfering speech better, as well as providing a better speaker generalization [25]. Recently, a third type of model, namely convolutional neural networks (CNNs) have been subject to an increasing amount of studies in the field of speech enhancement and separation [20,26,27]. Many of the successful CNN model architectures are based on the convolutional encoder-decoder (CED) principle adopted from computer vision research [28,29]. As opposed to conventional CNN architectures that only compress the feature dimension by using pooling layers, the CED compresses in the encoder part and decompresses in the decoder part of the model by using upsampling layers or strided deconvolutions [30]. By adding skip connections from same-sized layers of the encoder to the decoder, high-resolution structural information can be preserved, which is especially important for a regression task such as speech enhancement, where a mapping from the noisy speech spectrum to a same-sized target clean speech spectrum has to be learned. Park et al. [27] demonstrate the effectivity of different variations of CEDs and Takahashi et al. [20] introduce densely connected convolutional layers and multi-band processing into the architecture. A CED network has also been used by Zhao et al. to enhance encoded and subsequently decoded speech in a postprocessing step, showing remarkable generalization capabilities even to unseen codecs [18].
One way of leveraging the advantages of different network topologies is to combine them into a single model and train this combined model on the task at hand. A combination of CNN and bidirectional LSTM is shown to significantly outperform feedforward DNNs and recurrent neural networks (RNNs) [31] for speech enhancement, with the restriction of the introduced model not being capable of real-time processing. A model for real-time processing, which integrates LSTM layers in the bottleneck of a CED network is introduced in [32].
A second approach for the combination of models in speech enhancement is to employ multi-stage processing, where either multiple identical models (cf. [33]) or most often different models are used in succession to improve the enhancement performance. Applied to classical speech enhancement, this principle is generally used to achieve a higher noise attenuation, e.g., with the multi-stage Wiener filter approach [34], which in turn leads to degradations of the speech quality. Different from that, some studies have focused on first performing speech separation and subsequently enhancing the separated signals using nonnegative matrix factorization [35] or Gaussian mixture models [36]. In combination with deep learning models, the multi-stage paradigm has been applied to music source separation using feedforward DNNs for the separation task as well as the subsequent task of enhancing the separated signals [37]. A further possibility is proposed in [38], where denoising and dereverberation are addressed in subsequent stages using separately trained feedforward DNNs and joint fine-tuning of the two-stage model is carried out in a second step.
Most of the described deep learning models aim at a high noise attenuation and therefore can still degrade speech quality, especially for low SNRs and non-stationary noise types, or when iterative processing is employed. We propose to address this problem by first performing noise suppression and subsequently restoring natural sounding speech. Different to [37] and [38], we rely on specifically chosen DNN topologies with beneficial properties for each of the two tasks. An LSTM-based model with its ability to use longterm temporal context to distinguish between noise and speech is used for noise suppression. Inspired by its success for image restoration [28] and speech decoder postprocessing [18], we employ a CED network for speech restoration and residual noise suppression. We believe that this type of model is well-suited to perform a mapping from the input domain to an only slightly different target domain, as is the case with slightly distorted speech and (2020) 2020: 49 Page 4 of 26 undistorted clean speech. A further contribution is the reformulation of the MSA loss function for the joint estimation of real and imaginary parts of the clean speech spectrum, which is used with the LSTM-based model to aim at a high noise attenuation after the first processing stage. Finally, our work focuses on highly non-stationary noise types including interfering speech, which often have led to trouble in machine learning-based speech enhancement. 2 The paper is structured as follows: in Section 2, we introduce the speech enhancement framework used for both baselines and our proposed approaches based on deep learning. Next, a detailed description of our two-stage approach and the utilized DNN topologies is given in Section 3, followed by the experimental setup and network training details in Section 4. The evaluation results are presented in Section 5, and we conclude the work in Section 6.

Speech enhancement framework, classical and deep learning-based approaches
For the task of estimating the clean speech signal s(n) from a noisy microphone signal y(n), we employ a signal model of the form where the noise signal d(n) with the discrete-time sample index n is assumed to additively mix with s(n). The corresponding STFT domain representation, computed by applying a frame-wise window function with a frame length of L and a frame shift of R, followed by a K-point discrete Fourier transform (DFT), is given by with frame index and frequency bin index k ∈ K = {0, 1, . . . , K −1}. Most of the classical speech enhancement approaches and also many deep learning-based approaches rely on estimating a frame-and frequency bin-wise gain function G (k) to subsequently compute the estimated clean speech followinĝ

Classical approaches
In classical speech enhancement approaches employing parametric statistical models of speech and noise, the computation of the gain function typically depends on the a priori SNR ξ (k) and the a posteriori SNR γ (k). In this work, we consider g(·) to represent the well-known minimum mean-square error logspectral amplitude (MMSE-LSA) estimator [2] or the super-Gaussian joint maximum a posteriori (SG-jMAP) estimator [4] and use the decision-directed (DD) approach [1] for the estimation of ξ (k). Additionally, an estimate of the noise power is required for the computation of ξ (k) and γ (k) and can be obtained utilizing the minimum statistics (MS) approach [7].

Deep learning-based approaches
Deep learning-based approaches use neural network (NN) models being trained beforehand on a set of training data to perform the speech enhancement task. In general, they can be described as a mapping from an input feature vector x to the output vector based on the non-linear composite function f (·) defined by the network topology, and the trainable parameters . The additional input of network hidden states h −1 for the preceding frame is used to model temporal context in recurrent neural networks, e.g., LSTMs.
In the context of deep learning-based approaches, G (k) from (3) is often referred to as a T-F mask separating clean speech and noise. The NN model can be trained in a supervised fashion to estimate these masks by minimizing the mask approximation (MA) loss function where G ideal (k) ∈ R are the ideal mask values representing the training targets andĜ (k) are the estimated mask values at the network output. In this case, the network output vector is composed as u = Ĝ (0),Ĝ (1), . . . ,Ĝ K 2 T , which can be obtained by reducing the summation in (6) to the elements k ∈ 0, 1, . . . , K 2 only, while halving the contribution at k = 0 and k = K 2 . A well-established choice for the ideal mask target is the IRM with the common parameter choice of β = 2 [15,25], making it formally comparable to the square-root Wiener filter gain function. 3 The MA loss function does not directly optimize the objective of minimizing the difference between estimated speech spectrumŜ (k) and clean speech spectrum S (k). In fact, the contribution of the estimates to the loss for each frequency bin is subject to a ratio of S (k) and D (k) and not directly to the energy distribution of S (k) and Y (k). This can lead to, e.g., the MA loss taking on high values for bins k, where both S (k) and Y (k) are close to zero and therefore no contribution to the loss should be considered regardless of the estimated mask valueĜ (k). Direct optimization of the aforementioned objective, which also preserves the benefits of estimating a mask value that can be restricted to a certain value range, can be accomplished by using the masked spectrum approximation (MSA) loss function [16]  Up to this point, the loss functions presented in (6) and (8) only operate on spectral magnitudes, which leads toĜ (k) ∈ R and in turn, following (3), the usage of the noisy phase for the enhanced speechŜ (k). One way of estimating the clean speech phase, that has been proven beneficial for deep learning-based speech enhancement, is to use an ideal complex mask G ICM (k) = S (k)/Y (k) as target and separately estimate real and imaginary part of G ICM (k) using an MA loss for training [17]. Different to that, we propose an alternative loss function that combines the advantages of clean speech phase estimation and the MSA loss paradigm of optimizing in the speech spectral domain. Such a complex MSA (cMSA) loss can be formulated, e.g., as where separate real-valued masksĜ R (k) andĜ I (k) are used to estimate the real and imaginary part of S (k), respectively. Here, {·} delivers the real part, and {·} the imaginary part of the argument. Applying the cMSA loss, the neural network output , and the enhanced signal is computed according tô A third possibility, which we will call complex spectrum approximation (cSA), is to directly estimate real and imaginary parts of the clean speech spectrum S (k) following whereŜ R (k) andŜ I (k) are the estimated real and imaginary parts, respectively.

New LSTM-based noise suppression followed by CNN-based speech restoration
The underlying idea of our newly proposed system is to employ separate processing stages for speech denoising and restoration, both using deep NN topologies with advantageous properties for the respective tasks. In the noise suppression stage, an LSTM-based network trained with the cMSA loss from (9) is employed to attain a strong noise attenuation, even at the cost of potentially introducing speech distortions. The subsequent restoration stage restores speech and further attenuates residual noise. For this second task a CED network is used, which has been found to be very well-suited for the restoration of slightly corrupted structured signals, e.g., in image restoration [28] or enhancement of coded speech [18]. The CED network training employs the cSA loss function defined in (11) and therefore a direct spectral mapping is performed in the second stage. The cSA loss function is chosen over a mask-based loss for two reasons: On the one hand, the (2020) 2020:49 Page 7 of 26 restoration of missing T-F regions in the estimated signal can be quite difficult for a maskbased approach, requiring very large mask values. On the other hand, the CED network is specifically designed to map to outputs of the same domain as the input, in this case speech spectra rather than spectral masks. In the following, an overview of the system is given and the chosen network topologies for both stages are described in detail.

System description
The overall processing scheme of our two-stage approach is depicted in Fig. 1. At first, the STFT representation of the noisy speech Y (k) is input to the noise suppression stage. A feature extraction including mean and variance normalization (MVN) is performed to obtain the normalized feature vectorx (1) , where the MVN is carried out using vectors of means μ (1) x and standard deviations σ (1) x obtained during network training. The feature extraction also includes concatenating L − frames of past and L + frames of future context to the features extracted for the current frame, but for more strict latency requirements the system can also work with L + = 0, i.e., no lookahead at all. Based on the input features x (1) and the network parameters (1) obtained during training, the noise suppression network estimates separate real-valued masksĜ R (k) andĜ I (k) for the real and imaginary part of the noisy speech spectrum Y (k). In the subsequent masking block, these masks are applied following (10) and the estimated denoised speech spectrumŜ (1) The interpolation block in between the two stages is increasing the frequency resolution for processing in the speech restoration stage to enable the employed CED network to fully leverage its potential of mapping to a high-resolution estimated spectrum and in turn restoring spectral details of the clean speech signal. The interpolation is realized through applying a K-point inverse DFT (IDFT) followed by zero-padding in the time domain and subsequent transformation back to the frequency domain via a K -point DFT K > K , resulting in the interpolated denoised speech spectrumŜ (1) k .
In the speech restoration stage, a second feature extraction, including MVN using the vectors of means μ (2) x and standard deviations σ (2) x obtained during speech restoration network training, is employed. The resulting feature representationx (2) is input to the speech restoration network, which directly maps to the enhanced speech spectrum S (2) k , using the trained network parameters (2) . Reconstruction of the corresponding enhanced time-domain signalŝ (2) (n) is subsequently realized through IDFT, synthesis windowing, and overlap-add (OLA).

First-stage noise suppression network topology
The noise suppression network relies on the LSTM-based topology depicted in Fig. 2, where FF denotes fully connected feedforward layers and the sizes of the feature representations for each layer of the network are shown before and after the respective layers. The input feature vectorx (1) is composed of MVN-normalized spectral magnitudes 4 using only the non-redundant DFT bins, which results in a feature vector size of C = (L − + 1 + L + ) · K 2 + 1 . The employed network uses a single FF layer upfront, which can help to learn a good feature representation for the temporal modeling in the two following LSTM layers [25,42]. Two additional FF layers lead to the output layer estimating the T-F masks for noise suppression. All of the FF layers are composed of 425 nodes and use rectified linear unit (ReLU) activations [43] with the exception of the output layer, which has a number of nodes corresponding to the DFT size K and uses a tanh activation to restrict the estimated masks toĜ R (k),Ĝ I (k) ∈[ −1, 1]. Such a restriction of mask values has been found to ease optimization and to decrease the ideally achievable estimation accuracy only marginally [17,22]. The LSTM layers use the standard implementation as introduced in [44] without peephole connections, and also consist of 425 nodes.

Second-stage speech restoration network topology
The network topology we deem best-suited for the speech restoration stage is the CED network, of which two different architectural setups are depicted in Figs. 3 and 4. The sizes of the feature representations for each layer are once again shown before and after each layer, where the first two sizes refer to the frequency and frame axis, respectively, and the third size refers to the number of feature maps. The input features for the CED are constructed from the complex denoised spectrumŜ (1) k according to using separate feature maps for real and imaginary part and applying MVN to obtain the normalized feature representationx , with (·) T being the transpose. The way these feature maps are constructed makes sure that the convolutional  (12) is used to obtain a frequency axis size M that is a multiple of four, which is necessary for a total dimension reduction by a factor of four in the encoder and subsequent reconstruction of equally sized output features in the decoder. The output structure corresponds to that of the input features, when replacingŜ (1) k in (12) with the final clean speech spectrum estimatesŜ (2) k . The featuresx (2) do not use a frame lookahead or any other information from future frames, which results in the speech restoration stage not adding any additional algorithmic delay. The actual network topology is inspired by the CED network from [18] and comprises several building blocks. Convolutional layers are denoted by Conv(F, N × 1), with F determining the number of filter kernels, N being the kernel size on the frequency axis and the kernel size on the frame axis being specified as one, resulting in onedimensional convolutions. Transposed convolutions [45] are denoted correspondingly by Conv T (F, N × 1). All convolutional and transposed convolutional layers use the leaky ReLU activation function [46], zero-padding to ensure a consistent size of the output feature maps with respect to the input, and a stride of one, except where indicated by a /2 besides the blocks, denoting a stride of two. The encoder part of the CED uses either maximum pooling (Fig. 3) or convolutions with a stride of two (Fig. 4), each reducing the size of the features with respect to the frequency axis by a factor of two. As a counterpart in the decoder, either upsampling layers (Fig. 3)  stride of two (Fig. 4) are employed, leading to a doubling of the frequency axis size. We indicate these different setups by du and tr, respectively, where the tr-setup can significantly reduce the computational complexity with respect to the number of multiplications needed [47]. In the bottleneck between encoder and decoder, we also employ a convolutional layer and use a total number of two skip connections from encoder to decoder at points of matching feature map dimensions.

Databases and preprocessing
For the training and evaluation of our proposed system, we use clean speech data from the TIMIT [48] and NTT super wideband [49] databases (British and American English only for NTT), both downsampled to 8 kHz. We merge both databases to one large set containing a total amount of 7.5 h of speech and construct distinct training, development, and test sets by using 60%, 20%, and 20% of the total data, respectively. We make sure that there is no overlap in speakers between the distinct sets and the amount of female and male speakers is balanced. The clean speech data is mixed with cuts of three different café noises (noise file durations of 34:00, 39:23, and 42:02 minutes) from the QUT noise database [50] and the babble and restaurant noise (with durations of 3:55 and 4:46 min, respectively) from the AURORA-2 database [51]. We deliberately choose only highly non-stationary noise types including interfering speech to evaluate the performance of our system under challenging conditions. The noisy data is generated by defining three distinct parts of each noise file for training, development, and test (spanning 60%, 20% and 20% of the noise file duration, respectively) and mixing random cuts of the respective parts with the clean speech data. For the training set, each speech file is mixed with a cut of each of the 5 noise files, applying SNRs of 0, 5 and 10 dB, resulting in a total of 5 · 3 = 15 training conditions, corresponding to a total amount of 67.5 h of training material. The development and test data is constructed accordingly, but additionally using SNRs of −5 and 15 dB unseen in training. Please note that these additional SNR conditions are only used for reporting of results and are removed from the development set for parameter optimization and loss monitoring during training. Furthermore, a separate test set employing the pub and office call center noises from the ETSI database [52] as unseen noise files is constructed and used to evaluate the generalization properties of the tested systems. Please note that the office call center noise contains interfering speech as well as non-stationary non-speech sounds such as clatter or typing, whereas pub noise mainly contains interfering speech. To further evaluate the generalization properties to non-stationary noise types without interfering speech, we include an additional separate evaluation using the traffic noise from the ETSI database [52]. 512. Thus, the input signal representations for second-stage network training match the output of the interpolation block (see Fig. 1) during two-stage processing, resulting in a second-stage feature size of M = K 2 +4 = 260 on the frequency axis.

Training of the LSTM-based noise suppression
Training of the LSTM-based network for noise suppression is conducted using the backpropagation through time (BPTT) algorithm [53] in combination with the cMSA loss function (9) and the Adam optimizer [54]. We use an initial learning rate of μ = 0.001, a batch size of 25, and set the remaining parameters for Adam according to the recommendations in [54]. To prevent the network from overfitting, an L2 weight-decay of 0.0002 is employed. The BPTT training is carried out in a truncated fashion with fixed-length sequences of size 100 being extracted from the training set utterances. Any remaining shorter sequences are zero-padded to match this size. The contributions of padded sequence parts are set to zero for the computation of gradients in BPTT. During training, we monitor the development set loss and halve the learning rate once it does not decrease for more than three epochs, restarting training using this new learning rate from the epoch with currently lowest development set loss. Training is stopped when a minimal learning rate of μ min = 0.0001 is reached.

Training of the CNN-based speech restoration
The CNN-based CED networks for both the du-and tr-setup are trained using standard backpropagation [55] employing the cSA loss function (11). The Adam optimizer with an initial learning rate of μ = 0.0001, a batch size of 16, and otherwise the same parameter settings as in LSTM training is utilized. During CED training, the learning rate is multiplied by a factor of 0.6 and network training is resumed from the epoch with best development set loss, if the development set loss does not decrease for more than two epochs. The training is once again stopped when a minimal learning rate of μ min = 0.00001 is reached. The parameters defining the network topology in terms of number of filter kernels and the kernel size on the frequency axis are chosen to F = 88 and N = 24, respectively. These parameter values have been obtained from the optimal values found in [18] for a similar CED network by keeping proportions with regard to the input feature size fixed.

Baseline and proposed methods
We compare our new approach against several baseline methods, first considering the classical MMSE-LSA and SG-jMAP weighting rules using the DD a priori SNR estimator and the MS noise power estimator, as described in Section 2.1. These classical approaches use optimal parameters adopted from [56]. Furthermore, LSTM-based baseline methods using an MA loss (6) with IRM targets (7) and β = 2, or alternatively the MSA loss function (8) are considered and referred to as LSTM-IRM and LSTM-MSA, respectively. Both use LSTM topologies comparable to the proposed noise suppression network (LSTM-cMSA), with the only difference of employing a sigmoid output activation for the estimation of magnitude masks in the range [0, 1]. Note that the LSTM-IRM baseline is quite comparable to the approach used in [25], with only slight changes to the employed LSTM topology and the usage of spectral magnitude features for comparability to the proposed approach. As a further baseline, the proposed second-stage CED network (du-setup) is trained as a single-stage enhancement method using the cSA loss (11) and is referred to as CED-cSA-du. A first (novel) two-stage method dubbed LSTM-cMSA+DNN-cSA is investigated, consisting of the proposed LSTM-cMSA network in the first stage, followed by a feedforward DNN network trained with the cSA loss (11), the same training scheme as described in Section 4.3, and also the same features and targets used for training of the finally proposed second-stage network (CED-cSA). The DNN-cSA second-stage network uses five hidden layers with 800 units each, resulting in a total amount of parameters comparable to the CED-cSA network. Furthermore, we report results of our proposed novel model using the du-and tr-setup described in Section 3.3 for the second-stage network architecture (called LSTM-cMSA+CED-cSA-du and LSTM-cMSA+CED-cSA-tr, respectively). We also experimented with an additional joint fine-tuning step after separate training of both stages, which did not obtain significantly improved results with respect to only training separately. Furthermore, a joint training of both models from scratch has been evaluated, but did not lead to converging trainings. Therefore, both fine-tuning variants were not further considered for the experimental evaluation. As an additional reference, we include the two-stage method pre-published in [39], which uses a convolutional LSTM layer [57] in between encoder and decoder of the second-stage network. A further difference to our proposed model is the usage of maximum pooling and upsampling layers instead of the computationally more efficient strided and transposed convolutions in the proposed CED-cSA-tr network. We call the method from [39] LSTM-cMSA+CLED-cSAdu and compare it to the proposed methods in terms of performance and computational complexity.

Instrumental quality measures
We choose to only employ instrumental measures 5 operating on the enhanced speecĥ s(n), the noisy speech y(n), and the clean speech reference s(n). The signal-to-noise ratio improvement (SNRI) provided by the system under test is measured according to ITU-T G.160 [58]. Note that the G.160 Recommendations [58] do not include highly non-stationary noise conditions, but nonetheless, SNRI is regularly used for evaluation under such conditions [59][60][61]. Thus, we employ SNRI only as an indicator for the noise suppression capabilities of the respective system. Furthermore, we use perceptual evaluation of speech quality (PESQ) [62] to obtain a mean opinion score for listening quality objective (MOS-LQO), which is quite correlated with the overall speech quality perception of human listeners, although not perfectly suited for speech with (residual) noise. To assess the intelligibility of the enhanced speech, the short-time objective intelligibility (STOI) measure [63] is utilized. The STOI measure is specifically designed for 5 For quality evaluation of speech enhancement algorithms, it is often preferable to use a component-wise evaluation according to the so-called white-box [64] or black-box [23] approaches, to be able to investigate the effects on the speech component and the noise component of the noisy mixture separately. These approaches rely on the multiplication of the clean speech spectrum S (k) and the noise spectrum D (k) with the estimated gain functionĜ (k) or an artificial complex-valued gain function computed from the enhanced speechŜ (k) and the noisy speech Y (k) to compute the filtered speech componentS (k) and the filtered noise componentD (k). Unfortunately, those component-wise approaches can be problematic when using phase-aware processing withĜ (k) = |Ĝ (k)| · e j∠Ĝ (k) ∈ C, where the multiplication withĜ (k) includes applying the phase factor e j∠Ĝ (k) to S (k) and D (k). This can lead to artifacts in the reconstructed filtered componentss(n) andd(n) after IDFT and OLA, which would have been suppressed through the combination of phase terms inŜ (k) =Ĝ (k)(S (k) + D (k)) and subsequent reconstruction ofŝ(n) via IDFT and OLA.

Results and discussion
In the following, we discuss the results of the experiments conducted with seen and unseen noise types and subsequently analyze our proposed approach to give further explanations for the performance improvements we observe compared to the baseline methods.

Results on seen noise types
The results for the development set data using noise types that were seen during training are presented in Table 6 1. On average, but also for each single SNR condition, the deep learning-based methods substantially outperform the classical MMSE-LSA and SG-jMAP in terms of PESQ, STOI, and SNRI. Most notably, the classical methods are not able to improve the intelligibility in terms of STOI compared to the unprocessed noisy speech, which has an average STOI value of 0.75. In contrast, the deep learning-based methods improve on that value by up to 0.13 points (0.88) averaged over the SNR conditions. This observation is in line with the results of earlier studies [14,17], which also report higher intelligibility improvements for deep learning-based methods, especially in low-SNR conditions. Furthermore, MMSE-LSA and SG-jMAP only slightly improve the overall quality in terms of PESQ over unprocessed noisy speech for the very challenging −5 dB condition, whereas the deep learning-based methods are able to significantly improve PESQ, although not having seen a comparably low SNR during training.
Comparing the single-stage baselines LSTM-IRM and LSTM-MSA, we observe consistent superiority of LSTM-MSA in terms of PESQ and SNRI with average improvements of 0.08 MOS points and 1.52 dB, respectively, confirming the advantage of optimization in the speech spectral domain as opposed to the mask domain. The proposed LSTM-cMSA noise suppression network, employed without second-stage processing, can further improve PESQ by 0.02 MOS points and SNRI by an impressive 5.22 dB (23.11 dB) compared to LSTM-MSA (17.89 dB) and averaged over all SNR conditions. For the low-SNR conditions −5 and 0 dB though, LSTM-cMSA provides lower PESQ values than LSTM-MSA. This could be due to the fact that LSTM-cMSA is implicitly estimating the clean phase or at least incorporates phase information by using real and imaginary part of the clean speech spectrum S (k) as targets. Leveraging this information is potentially very difficult for low SNRs where noise can be dominant in the mixture and therefore conceal relevant information on phase inherent to the magnitude features, e.g., the location of harmonics. Nonetheless, LSTM-cMSA provides considerably higher noise suppression than all other single-stage methods in terms of SNRI, even in low-SNR conditions. This is in line with the original idea of aiming at a very high noise suppression in the first stage and allowing some speech distortions, which in turn can be restored by the second stage. Due to this observation in combination with providing the best average performance on the development set, we choose to employ LSTM-cMSA as the first processing stage for all experiments including second-stage processing. Employing CED-cSA-du as a single-stage enhancement network leads to a deterioration compared to LSTM-cMSA of PESQ by 0.04   points and SNRI by 2.60 dB averaged over all SNRs, while STOI remains comparable. For the 0 and 5 dB SNR conditions, CED-cSA-du performs worst of all deep learning-based methods in terms of PESQ and worse than LSTM-MSA and LSTM-cMSA in terms of SNRI. However, the performance of CED-cSA-du compared to the LSTM-based methods improves for high-SNR conditions, even providing the best performance among singlestage methods in terms of all measures for 15 dB SNR. This shows that CED-cSA-du is well-suited for high input SNRs, which supports its usage as a second-stage network, where noise suppression has already been applied in the first stage.
The proposed two-stage method (for both du-and tr-setup of LSTM-cMSA+CED-cSA) improves all employed instrumental measures for all SNR conditions compared to only using LSTM-cMSA and all except SNRI for the 15 dB SNR condition compared to only using CED-cSA, providing a notable average PESQ improvement of 0.11 MOS points over the best single-stage method. Even higher PESQ improvements of up to 0.18 MOS points with respect to the best performing single-stage methods can be obtained for the 5 and 10 dB SNR conditions. Comparison with the other investigated two-stage method (LSTM-cMSA+DNN-cSA) shows significantly higher PESQ, when using the CED-cSA as second stage, while being comparable or even slightly worse in terms of SNRI. Our interpretation of this observation is that the CED-cSA network, while providing comparable additional noise suppression, is better suited to restore missing or degraded parts of speech and therefore provides better overall speech quality in terms of PESQ. Although yielding only second-best performance in terms of PESQ for the difficult −5 dB SNR condition, the second-stage processing with CED-cSA can still slightly improve on using LSTM-cSA only. Concerning the intelligibility in terms of STOI, the LSTM-based single-stage methods roughly provide the same performance, but using the LSTM-cMSA+CED-cSA methods further improves STOI for all SNR conditions. It provides gains of up to 0.05 points for the lower SNR conditions, where improving the intelligibility is very relevant. Direct comparison of the different second-stage setups LSTM-cMSA+CED-cSA-du and LSTM-cMSA+CED-cSA-tr shows comparable or slightly better performance of PESQ and improved noise suppression in terms of SNRI for the tr-setup. Comparing the results on the development set with the results obtained on the test set depicted in Table 2, the same conclusions on performance trends and model ranking for all three measures are obtained from the evaluation of both sets. The overall performance on the test set is slightly worse for all models including the classical methods, which do not rely on the development set for parameter tuning. This shows that the test set is Table 2 Instrumental measures for baseline and proposed approaches averaged over seen noise types of the test set data. Note that the SNR conditions of −5 and 15 dB are unseen during training, whereas the remaining SNRs have been seen. Best two approaches are in boldface, high-complexity reference LSTM-cMSA+CLED-cSA-du excluded  slightly more difficult to process for different types of speech enhancement methods and the deep learning-based approaches generalize well to the test set data. Average results of the best proposed method LSTM-cMSA+CED-cSA-tr are only very slightly worse in terms of PESQ and STOI, but even slightly improved in terms of SNRI with respect to the high complexity reference LSTM-cMSA+CLED-cSA-du.

Results on unseen noise types
The results obtained from evaluating the unseen noise test dataset are presented in Table 3, where results are averaged over both noise types (pub and office noise). Once again, similar trends and model rankings compared to the evaluation with seen noise types can be observed, which shows a good generalization of the deep learning-based methods to these highly non-stationary unseen noise types in general. Especially, the two-stage LSTM-cMSA+CED-cSA-tr network is able to provide improvements over LSTM-cMSA comparable to the ones obtained with seen noise types (0.11 MOS points, 0.03, and 2.40 dB in terms of PESQ, STOI, and SNRI, respectively). The comparison of deep learning-based methods for unseen pub and office noise in 5 dB SNR, depicted in Fig. 5, shows that all single-stage methods perform notably better in office noise according to PESQ. However, pub noise, which contains mostly interfering speech, seems to be quite difficult for the single-stage methods. This difference in overall quality can be mitigated to some extent by the usage of the proposed second stage (LSTM-cMSA+CED-cSA-tr), which improves PESQ in pub noise by an impressive 0.17 MOS points with regard to LSTM-cMSA. The additional analysis for unseen traffic noise   is also depicted in Fig. 5 and shows that the evaluated methods also generalize well to a non-stationary noise type not containing interfering speech (whereas model training was focused on noise types including interfering speech). The proposed two-stage network LSTM-cMSA+CED-cSA-tr is able to improve on using only LSTM-cMSA by 0.11 MOS points, whereas the two-stage reference network LSTM-cMSA+DNN-cSA does not provide an improvement in speech quality in terms of PESQ. For all three evaluated noise types, using the tr-over the du-setup of our proposed system improves all three quality measures.

Model complexity
An analysis of the computational complexity of our proposed LSTM-cMSA+CED-cSA setups and the high-complexity reference LSTM-cMSA+CLED-cSA-du [39] in terms of number of parameters, multiplications, and the average time needed to process one frame (L = 256 samples), resulting in a real-time factor, is presented in Table 4. The frame processing time was measured on an Intel Core i5 machine clocked at 3.4 GHz using our Tensorflow implementation without any further optimizations to speed up inference and relying exclusively on CPU processing. With an average total frame processing time of 10.8 ms while using a frame shift of 16 ms (R = 128), the proposed approach using the tr-setup can be processed in real-time (real-time factor 0.68) without employing GPU processing. The du-setup increases the number of multiplications, which leads to a real-time factor of 1.14, while showing even slightly worse performance compared to the tr-setup (c.f. Table 3). The usage of a convolutional LSTM layer and maximum pooling and upsampling operations instead of strided and transposed convolutions in the LSTM-cMSA+CLED-cSA-du reference adds complexity in terms of number of parameters and multiplications, leading to a furthermore increased real-time factor of 1.85. In combination with the overall comparable performance that LSTM-cMSA+CLED-cSA-du offers with respect to our newly proposed method (slight improvements in PESQ, but  Table 3), we conclude that a recurrent model structure is not needed for the second stage of our two-stage system and we therefore can drastically reduce model complexity, while we are able to preserve performance.

Analysis of the two-stage approach
To further analyze the reasons for the observed quality improvements with our proposed two-stage approach, the enhanced speech spectrograms obtained with the deep learningbased methods are compared, using an exemplary test set utterance in pub noise at 5 dB SNR. The spectrograms of clean speech s(n), noisy speech y(n), and enhanced speechŝ(n) for the different methods are shown in Fig. 6. Comparing the output of the two singlestage methods LSTM-MSA and LSTM-cMSA (third and fourth spectrogram from the Table 4 Comparison of complexity in terms of number of trainable parameters, multiplications and real-time factor (measured on an Intel Core i5 machine clocked at 3.4 GHz) for our proposed methods and the high-complexity reference pre-published in [39]. Comparisons are given for the second stage, for which the methods differ, and for the total two-stage system  top, respectively) shows the higher noise suppression that can be obtained with LSTM-cMSA. This comes at the cost of suppressing some parts of the speech signal as well, which can be examined in the highlighted areas in the respective spectrograms. Proceeding to the outputs after second-stage processing with DNN-cSA (third from bottom) and CED-cSA-tr (second from bottom), it can be observed that certain previously missing or distorted parts are restored (again highlighted in the respective spectrograms). Furthermore, CED-cSA-tr is able to more accurately restore the harmonic details of the original clean speech compared to DNN-cSA, as can be seen, e.g., in the rightmost highlighted region. We can credit this to the CED topology, which, as opposed to a fully connected topology, puts a focus on local dependencies over frequency through the use of convolutional kernels and is able to process different frequency regions with shared parameters, which we believe to be especially advantageous for the reconstruction of harmonic structures. Moreover, the CED is able to use high-resolution information on the clean speech inherent to the noisy features directly via its skip connections, which can also aid a more detailed reconstruction. The comparison of the proposed LSTM-cMSA+CED-cSAtr network with the high-complexity reference LSTM-cMSA+CLED-cSA-du furthermore shows, that comparable speech restoration and noise suppression capabilities can be achieved with our newly proposed method, while employing significantly less model parameters and computational resources. 7

Conclusion
In this paper, we have proposed a new two-stage approach for speech enhancement, using specifically chosen network topologies for the subsequent tasks of noise suppression and restoration of natural sounding speech. The first stage consists of an LSTM network estimating T-F masks for real and imaginary parts of the noisy speech spectrum, while the second stage performs spectral mapping using a convolutional encoder-decoder (CED) network. Employing only the noise suppression stage trained with the complex masked spectrum approximation (cMSA) loss, we observe an impressive gain of more than 5 dB in SNR compared to the baselines, but only slight or no gains in terms of overall quality (PESQ). When employing both stages, average improvements of PESQ by about 0.1 MOS points can be obtained in unseen highly non-stationary noises including interfering speech. Furthermore, our approach also improves STOI in low-SNR conditions compared to the baselines.