Integrated acoustic echo and background noise suppression technique based on soft decision

In this paper, we propose an efficient integrated acoustic echo and noise suppression algorithm using the combined power of acoustic echo and background noise within a soft decision framework. The combined power of the acoustic echo and noise is adopted to the integrated suppression algorithm based on soft decision to address the artifacts such as the nonlinear distortion and the disturbed noise introduced from the conventional methods. Specifically, in the unified frequency domain architecture, the acoustic echo and noise signal are efficiently able to be suppressed through the acoustic echo suppression algorithm based on soft decision without the help of the additional noise reduction technique.


Introduction
Recently, hands-free systems are widely used for safety and convenience in the mobile communication. However, such an equipment introduces specific technical difficulties due to the background noise and the echoes by acoustic coupling between a loudspeaker and a microphone of this equipment [1,2]. Thus, for handsfree mobile equipment, the serial combination of the acoustic echo cancellation (AEC) and noise reduction (NR) algorithm has been predominantly considered to achieve the improved performance and sufficient quality of the transmitted speech signal [3,4]. Indeed, the performance of the conventional integrated system is significantly affected by the combined structure of the AEC and NR algorithm. Generally, in the conventional unified structure where the NR module exists after the AEC algorithm, noise estimation can be disturbed by the AEC processing. Also, in the unified structure where the NR algorithm is placed before the AEC algorithm, it also introduces non-linear distortions on the echo signal which can disturb the identification operation [5]. Therefore, much work has been dedicated to the problem of improving the performance of the combined structure depending on AEC and NR algorithm. In [6], Gustaffson et al. used a single perceptually motivated weighted rule to suppress both noise and residual echo in a frequency domain. However, this method needs the adaptive echo canceller to identify the echo path impulse response for eliminating the undesired echo effect, which also affects the performance of the NR algorithm. In [7], Habets et al. presented the joint suppression technique of stationary (e.g., background noise) and non-stationary interference (e.g., echo) using a soft decision approach. But, an estimate of the variance of the echo signal was assumed to be known a priori, which inherently requires the AEC before the NR module. Other closely related technique by same authors is an approach of combined suppression of residual echo, reverberation, and background noise in a fashion of the post-filter following the traditional AEC [8]. But, the cancellation is performed directly on the waveform as in [7,8]. The algorithm is sensitive to the misalignment in the echo path response estimate. Also, it is hard to efficiently model the impulse responses lasting above milliseconds long with hundreds of coefficients. From this viewpoint, it is noted that a low complexity acoustic echo suppression (AES) algorithm by Faller [9] uses a spectral modification technique by incorporating the echo path response filter characterizing the actual echo path in a frequency domain. Recently, our previous approach in [10] presented the novel acoustic echo suppression (AES) algorithm based on soft decision without the help of the AEC and an additional residual echo suppression (RES), which conventional methods substantially need [10]. However, this technique has a problem in that the background noise is not taken into consideration for suppression, which can not be considered realistic.
In this paper, we propose a novel approach to the integrated suppression algorithm where the combined power of acoustic echo and background noise is incorporated based on soft decision as in [10] to directly suppress both strong acoustic echo and noise signal in a frequency domain. The proposed method efficiently estimates the echo and noise power separately and summates them to provide the unified framework in determining and modifying the suppression gain based on soft decision. This is clearly different from the conventional integrated strategies requiring the AEC and NR independently. For this, our approach directly estimates the spectral envelope of the echo signal instead of identifying the echo path impulse response in a time domain. Also, the background noise is estimated during near-end speech and echoabsent periods. In particular, the acoustic echo and noise signal are able to be reduced at a time through a single gain based on soft decision using the estimated combined power. Based on this, the proposed method can efficiently suppress the acoustic echo and noise without the help of an additional residual signal suppressor. Accordingly, the proposed unified structure addresses the problems associated with the residual echo and noise produced by the conventional unified structure where the NR operation is placed after the AEC algorithm or vice versa. The performance of the proposed algorithm is evaluated by both the subjective and objective quality tests and is demonstrated to be better than that of the conventional methods.

Proposed integrated suppression algorithm based on soft decision
In the previous section, we note that the previous AES technique in [10] needs the additional NR before/after the AES architecture for suppressing noise. However, this procedure could have a drawback such as the nonlinear distortion on echo or the disturbed noise power estimate as happened in the conventional integrated system [5]. Considering the case that the NR operation is placed after the AES algorithm, the noise power estimation can be disturbed by the AES processing. On the contrary, in the unified structure where the NR algorithm is simply placed before AES, it also introduces non-linear distortions on echo signal, which can disturb the identification operation. In order to reduce the problem resulting from serially combined structure, we propose a novel approach as the integrated suppression system based on the combined power of acoustic echo and background noise as in Figure 1 showing the block diagram of the proposed system based on soft decision. From the figure, it can be seen in advance that the proposed method can suppress the acoustic echo and the noise signal with a single gain based on soft decision. For this, the noise and echo spectral are separately and efficiently estimated and combined by a single power in the soft decision framework. Since we take the frequency domain AES algorithm in [10] as a baseline, we should reassume that two hypotheses to incorporating the discrete Fourier transform (DFT) spectrum of the noise signal D(i,k),H 0 and H 1 , indicate near-end speech absence and presence as follows: where E(i, k), S(i, k), and Y(i, k) represent the DFT spectra of the echo signal, the near-end speech, and the input signal picked up by the microphone with a time index i and frequency index k.
Under the assumption that D(i, k), E(i, k), and S(i, k) are characterized by separate zero-mean complex Gaussian distributions, the following are obtained [10].
where l e (i,k),l d (i,k), and l s (i,k) are the variance of the echo, noise, and near-end speech, respectively. The near-end speech absence probability (NSAP) p(H 0 |Y(i, k)) for each frequency band is derived from Bayes' rule such that [10]: where q = p(H 1 )/p(H 0 ) and p(H 0 )(= 1-p(H 1 )) represent the a priori probability of near-end speech absence. Substituting (2) and (3) into (4), the likelihood ratio Λ(Y(i, k)) can be computed as follows: For (5), we define the a posteriori signal-to-combined power ratio (SCR) g(i, k) and the a priori SCR ξ(i, k) by where l cb (i, k) denotes the combined power of the echo and noise to simultaneously suppress, which should be estimated carefully. Also, ξ(i, k) is estimated with the help of the well-known decision-directed (DD) approach [10]. Then where a DD is a weight and P[z] = z if z ≥ 0, and P[z] = 0 otherwise. Also, Ŝ(i-1, k) is a kth frequency estimate of the near-end speech at the previous frame, andλ cd (i, k) is the estimate for l cb (i, k).
Forλ cd (i, k) , we first estimate the power of the echo signal when the near-end speech signal is not present in the observation (single-talk), as given byλ where a le is a smoothing parameter. Note that noise is not taken into account in this update scheme, since it is assumed that the echo is not correlated with the noise and the power of the echo signal is more dominant than the noise power. The estimated magnitude spectrum of echo |Ê(i,k)| is given by with the far-end speech signal X d (i, k) and the gain filter H(i, k) characterizing the response of the echo path that is achieved by the magnitude of the least squares estimator [9]  where * denotes the complex conjugate and d indicates d samples delay. Since the echo path is time varying, H(i,k) is estimated iteratively as in [10]. Note that, since Y(i, k) is not affected by the NR algorithm, the estimate of the echo path response does not suffer from the non-linear distortion by the NR operation. And the update of the estimate H(i, k) should be frozen during the double-talk periods to prevent the divergence of H(i, k). To detect a double-talk period, the cross-correlation coefficients-based double-talk detection method proposed by [4] in the frequency domain is implemented. More specifically, (1) the cross-correlation coefficient between the microphone input and the estimate echo, and (2) the cross-correlation coefficient between microphone input and the residual error of the suppressor are computed and used to detect double-talk periods on each frame.
Based on the estimated echo power, we propose the combined power incorporating both the echo power and the background noise power. This is clearly different from the previous approach in [10] in that the method of [10] does not substantially estimate and include the background noise power because of the difficulty in estimating the noise power after the AES algorithm as explained in the first paragraph of Section 2. Specifically, the combined power l cb (i, k) is estimated by assuming that the acoustic echo and noise are uncorrelated and then combining the estimated echo and noise power based on the long-term smoothing scheme with a parameter a lcb such that  whereλ e (i, k) is derived as in (8). Actually, notice that if E[|D(i,k)| 2 |Y(i,k)] ≅ 0, (11) becomes the original AES algorithm as in [10], while (11) results in the conventional NR algorithm in case thatλ e (i, k) is nearly zero. Actually, the noise power estimate E[|D(i, k)| 2 |Y(i, k)] is obtained during noiseonly periods, which is achieved by the voice activity detection (VAD) algorithm that is a similar method as in IS-127 noise reduction algorithm known to give robust performance under various noise conditions [11]. For this reason, we can avoid the disturbed estimate of the noise power incurred by the AES algorithm. Note that since both e(t) and s(t) have a role as a dominant speech, the additional VAD to detect the noise signal periods is needed at the near-end. In addition, the proposed integrated algorithm is further improved in that distinct values of q's in (4) are estimated for different frames and frequency bins such as q(i, k) that can be tracked in time [12]. Therefore, the proposed algorithm employs a decision rule to decide whether the near-end speech signal is present in the kth bin, as given by in which the smoothing parameter a q is set as 0.3 and I(i, k) denotes an indicator function for the result in (6), that is, I(i,k) = 1 if h(i,k) >h th and I(i, k) = 0 otherwise. The value of q(i, k) can be easily updated using the h(i,   Finally, the estimated near-end speech Ŝ(i, k) for the echo and noise to be suppressed can be expressed aŝ (13) where p (H 0 |Y(i,k)),G(i,k) andG(i, k) are the NSAP in (4), suppression gain and overall suppression gain for the integrated system, respectively. Here, G(i, k) for each frequency band is derived from the Wiener filter such that Notice that a better echo and noise suppression rule throughG(i, k) is formulated to apply higher attenuation using (1 -p(H 0 |Y(i, k))) consisting of echo or noise (or both) alone while preserving the quality of the nearend speech.

Experiments and results
In order to compare the performance of the proposed integrated algorithm compared with the conventional methods, we conducted a quantitative comparison and subjective quality test under various noise conditions. Twenty test phrases, spoken by seven speakers and sampled at 8 kHz, were used as the experimental data. For assessing the performance of the proposed method, we artificially created 20 data files, where each file was obtained by mixing the far-end signal with the nearend signal. Each frame of the windowed signal was transformed into its corresponding spectrum through 128-point DFT after zero padding. We then achieved 16 frequency sub-bands to entirely cover full frequency  ranges (~4 kHz) of the narrow band speech signal, which is analogous to that of the IS-127 noise suppression algorithm [11]. The far-end speech signal was convolved with a filter simulating the acoustic echo path before being mixed [13,14]. The simulation environment was designed to fit a small office room having a size of 5 × 4 × 3 m 3 . The length of the simulated acoustic impulse response corresponds to 1,400 tap with the reverberation time T 60 = 0.14 s. The echo level measured at the input microphone was 3.5 dB lower than that of the input near-end speech on average. In order to create noisy conditions, white, babble, and vehicular noises from the NOISEX-92 database were added to clean near-end speech signals at signalto-noise ratios (SNRs) of 5, 10, 15, and 20dB. For the purpose of an objective comparison, we evaluated the performance of the proposed scheme and that of the conventional integrated algorithm. The performance of the approach was measured in terms of echo return loss enhancement (ERLE) and speech attenuation (SA), which are defined in [13].
To see the performance of the conventional integrated algorithm for comparison, we also evaluated the performance of the conventional acoustic echo and noise suppression algorithm by Gustafsson et al. [3], a which is a serial algorithm on the basis of a timedomain AEC and an additional noise and residual echo reduction filter. Also, we included the other integrated system in which the NR algorithm, that is, IS-127 noise suppression [11] is followed by the AEC with the post-filter as in [15]. For the AEC, a normalized least mean square (NLMS) adaptive filter with the number of filter taps, L = 128, was used, because we consider the used DFT size (i.e., 128) in our AES approach in terms of the computational complexity. Given noise environments, overall results for the aforementioned 20 data files are shown in Figure 2. ERLE and SAs scores were averaged to yield final mean score results for the case of three types of noise sources. From Figure 2a, it is evident that in most noisy conditions, the proposed integrated algorithm based on soft decision yielded a higher ERLE compared to the conventional techniques. This means that the proposed method effectively suppresses both the acoustic echo and noise signal. The SAs of the proposed method during double-talk periods are shown in Figure 2b, where we can observe that the SAs of the proposed scheme were better than that of the methods by Gustafsson et al. and Turbin et al. in all the tested conditions. This phenomenon indicates that the proposed algorithm preserves the near-end talk signal well during the double-talk periods. Also, the speech spectrograms are presented in Figure 3. From Figure 3e yielded by the proposed method, the residual echo and background noise are further reduced compared to the conventional techniques (Figure 3c and 3d) during the active far-end speech and noise period while preserving the near-end speech quite well. In addition, Figure 4 illustrates the speech segments that are results of the proposed algorithm. When we see the double-talk periods carefully, it can be easily seen that the enhanced output signal is successfully obtained even during the double-talk periods.
Finally, in order to evaluate the subjective quality of the proposed algorithm in terms of the distortion of the near-end speech and the residual echo, we carried out a set of informal listening tests. Opinion scores were, respectively, recorded by eleven listeners, and all the scores from the listeners were then averaged to yield final mean opinion score (MOS) results. Eleven listeners (6 men and 5 women) whose ages ranged from 20 to 35 participated in the experiment. Eight of them were students specialized in signal processing, while the others were not specialist. Ten test phrases, where five were spoken by a male speaker and the other were spoken by a female speaker, were used as the experimental data. Each phrase consisted of the two different meaningful sentences and lasted 8s as suggested in [16] Table 1 illustrates that the proposed approach outperformed or at least was comparable to the conventional methods in terms of overall subjective quality under the given noise conditions. In addition, we separately checked the performance of noise reduction which is one of the major goals in this work, which was achieved by the ITU-T P.835 [16], that is, the subjective quality test in terms of the background noise rating scale (5: not noticeable, 4: slightly noticeable, 3: noticeable but not intrusive, 2: somewhat intrusive, 1: very intrusive) in a similar manner as in the previous MOS test. As Table  2 shows, the performance improvement was found for all cases at all SNRs. These results confirm that the proposed integrated system is effective in suppressing the background noise.

Conclusions
In this paper, we have proposed a novel integrated suppression algorithm based on soft decision using the combined power of the estimated echo and noise power. The principal contribution of this study is that the proposed method can efficiently suppress the acoustic echo and noise signal through the suppression gain based on soft decision without the help of an additional residual echo and noise suppressor. The performance of the proposed algorithm has been found to be superior to that of the conventional technique. Future study areas may include the other superior statistical models characterizing the input signals such as the Laplacian and gamma as in [17], even though the Gaussian model can lead to more tractable mathematics. Endnotes a For [3], we set T n to 0.05 where T n denotes a minimum threshold.