Wavelet-Based Speech Enhancement Using Time-Frequency Adaptation

Wavelet denoising is commonly used for speech enhancement because of the simplicity of its implementation. However, the conventional methods generate the presence of musical residual noise while thresholding the background noise. The unvoiced components of speech are often eliminated from this method. In this paper, a novel algorithm of wavelet coe ﬃ cient threshold (WCT) based on time-frequency adaptation is proposed. In addition, an unvoiced speech enhancement algorithm is also integrated into the system to improve the intelligibility of speech. The wavelet coe ﬃ cient threshold (WCT) of each subband is ﬁrst temporally adjusted according to the value of a posterior signal-to-noise ratio (SNR). To prevent the degradation of unvoiced sounds during noise, the algorithm utilizes a simple speech/noise detector (SND) and further divides speech signal into unvoiced and voiced sounds. Then, we apply appropriate wavelet thresholding according to voiced/unvoiced (V/U) decision. Based on the masking properties of human auditory system, a perceptual gain factor is adopted into wavelet thresholding for suppressing musical residual noise. Simulation results show that the proposed method is capable of reducing noise with little speech degradation and the overall performance is superior to several competitive methods.


Introduction
Many speech signal processing applications have been applied in real-world [1]. The performance of speech coding and recognition system that operate in noisy environments decrease when high ambient noise levels occur. Therefore, speech enhancement system becomes a hot research topic to improve the performance of many computer-based speech recognition systems, coding and communication applications [2,3]. The exiting methods such as spectral subtraction [4,5], Wiener filtering [5,6], and Ephraim-Malah filtering [7] are well-known. Recently, wavelet shrinkage has emerged as a powerful tool for removing noise from signal [8][9][10][11]. It is a simple denoising technique based on the thresholding of the wavelet coefficients (WCs). Donoho and Johnstone firstly proposed a universal threshold for removing the additive white Gaussian noise [8,9]. In addition, they also proposed a level-dependent threshold to remove colored noise [12]. Bahoura and Rouat proposed a method of threshold adaptation in time domain by utilizing the use of Teager energy operator (TEO) [13]. The TEO can improve the discriminability for a speech frame. Chen et al. presented an improved wavelet-based speech enhancement method using the perceptual wavelet packet decomposition and the TEO. Lu and Wang proposed a method that the background noise can be almost removed by adjusting the wavelet coefficient threshold (WCT) according to the value of SNR [14]. After that, the adaptive wavelet-based methods in speech enhancement are widely presented. They utilize adequately WCT to improve the performance of speech enhancement.
For noisy speech, energies of unvoiced segments are comparable to those of noise. In the most techniques which use the wavelet thresholding for speech enhancement, they may not only suppress additional noise but also some speech components like unvoiced ones. Consequently, the detection of the voiced/unvoiced segments of the speech signals is a main problem in wavelet-based methods. Sheikhzadeh and Abutalebi [15] suggested an improved scheme, which categorized speech into either a voiced frame or an unvoiced frame. They increased WCT for high bands in a voiced frame and decreased the threshold values for high bands in an unvoiced frame. As a result, both low-frequency components  of voiced segments and high-frequency components of unvoiced segments are reserved by the soft thresholding algorithm. In addition, a number of methods were considered to reduce the effect of musical residual noise [16,17]. Since human ears cannot perceive additive noise when at levels below the noise masking threshold (NMT), Virag used the masking properties of the human auditory system to suppress the effect of musical residual noise [16].
In this paper, we introduce a novel wavelet-based speech enhancement using time-frequency adaptation for providing robustness to nonstationary and colored noise. The perceptual wavelet packet transform (PWPT) is applied to approximate the human auditory system. The wavelet coefficient threshold (WCT) of each subband is first temporally adjusted according to the value of a posterior signal-tonoise ratio (SNR). Consequently, utilizing V/U decision, the different threshold values are used as voiced and unvoiced frames to further improve the intelligibility of the processed speech signal. In addition, the musical residual noise can be efficiently suppressed to improve the perceptual quality when a gain factor is typically derived according to the NMT. Finally, an inverse PWPT is applied to resynthesize the enhanced speech.

Proposed Speech Enhancement Algorithm
Let s(n) represent a discrete time speech signal, and let d(n) denote a discrete time background noise signal. The noisecorrupted speech signal x(n) can be modeled as x(n) = s(n)+ d(n). The architecture of proposed speech enhancement method based on the time-frequency adaptation of the wavelet threshold is shown in Figure 1, and the proposed method is organized in the following seven steps.

Perceptual Wavelet Packet Transform (PWPT).
Critical subband is widely used in perceptual auditory modeling [18]. In this work, a perceptual wavelet packet transform (PWPT) is used to decompose the speech signal from 20 Hz to 16 kHz into 24 critical frequency subbands:   where w j ξ (k) means the kth coefficient of the ξth subband on level j. PWPT{·} denotes a process of PWPT. Figure 2 shows the implementation of an efficient fivelevel tree structure. Before an operator of downsampling by 2 in each level, the lowpass (LP) and highpass (HP) are implemented with 18-tap FIR filters derived from the Daubechies family wavelet shown in Figure 3 [19].

Speech/Noise Detector (SND) Using Teager Energy on Wavelet Domain.
Various techniques for detecting voiced/unvoiced (V/U) speech regions have been proposed; however, the performance of the speech/noise detector (SND) is dramatically degraded in noise. The teager energy operator (TEO) is a powerful nonlinear operator; it has been experimentally observed that the TEO can enhance 4 EURASIP Journal on Advances in Signal Processing the discriminability among speech and noise and further suppress the noise components from noisy speech signals [20]. The discrete-time TEO is applied to the wavelet coefficients w j ξ,m (k): where m represents the frame index. The simple SND algorithm computes the level 1 energy on wavelet coefficients of discrete-time TEO, t j ξ,m (k). If the percentage of energy concentrated in level 1 approximation is above 90% of the total energy, the current frame is regarded as speech-dominated segment: where t 1 1,m (k) and t 1 2,m (k) are the Teager coefficients of approximation and detail subband, respectively.
To further separate the unvoiced sound from noise segments, a method of unvoiced decision is proposed in this section. According to the tree structure of PWPT (shown in Figure 2), the three subenergies corresponding to the wavelet subband signals are defined as The unvoiced frame on mth frame is determined as

The Tracking of Subband Noise Power.
Since the background noise level varies with time, the tracking of noise plays a major role in determining the quality of a speech enhancement system, especially in nonstationary environment. The decision result from SND approach is used to update the subband noise power. Then, the subband noise power, σ 2 d (ξ, m), can be adaptively estimated by [21] where is the energy of ξth critical subband and is defined in later. α d (ξ, m), α d (ξ, m), and α p (ξ, m) all represent the smoothing parameter. p (ξ, m) and I(ξ, m) are a conditional signal presence probability and an indicator of voice-dominated, respectively. Observing (6), I(ξ, m) is an indicator of updating noise power. The parameter depends on the speech-present ratio and is determined by the decision of speech-only frame. If SND(m) = voiced or unvoiced sounds, let I(ξ, m) = 1. Consequently, α d (ξ, m) is increasing and the noise power of next frame is nearly updated from the current estimated noise power. Conversely, SND(m) = noise period, let I(ξ, m) = 0. Consequently, α d (ξ, m) is decreasing and the noise power of next frame is nearly updated from the current observed signal power.
The result of noise tracking can be used to calculate a posterior signal-to-noise ratio (SNR): where σ 2 d (ξ, m − 1) is the estimated noise power of the previous frame. The value of SNR post (ξ, m) is determined by the ratio of the observed ξth subband wavelet energy to the previous ξth subband estimated noise power. Consequently, the SNR post (ξ, m) parameter will help us sense how much the current subband is corrupted by noise. Therefore, we will use this information for denoising noise. During the initialization period, the observed power is assumed to be noise only and the noise spectrum is estimated by averaging the initial 10 frames.

Estimation of Noise Masking Threshold (NMT).
This subsection describes the incorporation of the human auditory masking properties into our enhancement system. The NMT is estimated on the WCs of PWPT. At first, WCs are obtained from the PWPT of noisy speech. The energy of ξth critical subband is calculated by where l(ξ) and h(ξ) are the coefficient indices of the first and last wavelet coefficients in ξth critical subband [16]. An excitation pattern B(ξ, m) can be regarded as an energy distribution along the basilar membrane. B(ξ, m) can be calculated by convolving the subband energy ε(ξ, m) with the spreading function F(ξ) given by [16,22] B(ξ, m) = F(ξ) * ε(ξ, m) A relative threshold offset O(ξ), which can be found in [12,16], specifies whether a speech frame is tone like or noise like. This threshold should be imposed when adjusting the log subband energy. Therefore where the values of the offset O(ξ) are all negative. Convolving the subband energy ε(ξ, m) with the spreading function F(ξ) increases the energy in each subband, so to multiply each B(ξ, m) by the inverse of the energy gain is necessary for renormalization. Accordingly, a normalized threshold is given by where G(ξ, m) denotes the gain factor between the spread energy B(ξ, m) and the subband energy ε(ξ, m) in dB.

G(ξ, m) is expressed as
Additionally, the normalized threshold Th(ξ, m) is compared with the absolute-hearing threshold (AHT) which is frequency-dependent and can be closely approximated as [16,22] AHT (13) with f in kHz.

Finally, the NMT T(ξ, m) is obtained by
where f is chosen as the central frequency of the critical band ξ.

Estimation of Wavelet Coefficient Threshold.
In this work, we propose a novel scheme that adjusts WCT according to the value of a posterior SNR and formulate the WCT as follows: where λ j = MAD j /0.6745 · 2 · log(N j ) means the leveldependent threshold λ j [12]. MAD j represents the absolute median estimated at the jth level. γ and η are the slope and center-offset of the Sigmoid function, respectively. These two factors are chosen to be 0.2 and 1, respectively. Observing (15), the value of λ(ξ, m) is adjusted by a Sigmoid functions, and its value varies with the estimate of a posterior signal-to-noise ratio while locating nonspeech segments. Otherwise, the smoothing parameter will be set one. γ and η are the slope and center-offset of the Sigmoid function, respectively. Elevating γ can decrease the transition range according to posteriori subband SNR. On the contrary, decreasing it would increase the transition range.
In general, a frame with high value of signal-to-noise ratio (SNR) implies that the current frame is a speechdominated frame. On the contrary, a frame with low value of SNR implies that the frame is either in a noise-only region or in a very noisy environment. So, the wavelet threshold of the frame should be made smaller for a speech-dominated frame. The wavelet coefficients are contributed mostly by the noise component in a noise-dominated frame.
The speech-dominated frame can be further categorized into two types: those are the voiced speech and the unvoiced speech according the V/U decision. A voiced frame possesses a strong tone-like spectrum in lower subbands, so that the WCs of lower frequency must be reserved. On the contrary, the WCT tends to increase in lower frequency if the frame is categorized as unvoiced speech. The voiced sounds are quasiperiodic in the time domain and harmonically structured. In frequency domain, these sounds are generally localized in bands that are less than 1 kHz. For many vowels of male and female voices, the statistic results indicate approximately that the frequency of the first formant does not exceed 1 kHz and is superior to 100 Hz. Consequently, when a voiced-dominated frame forms V/U decision, the WCT from (15) must be adapted to as where α L = 0.1 and α H = 1.0 are experimentally determined. The frequency boundary covers most of the tonelike frequency components. f ξ denotes the frequency bin of subband ξ.
In (16), more WCs in lower subbands must be properly reserved since a voiced frame contains strong tone-like components in the lower frequency. This can be accomplished by reducing the WCT in lower wavelet subbands.
However, the energy of the unvoiced sounds is usually concentrated in high frequencies (≥3 kHz). If an unvoiceddominated frame forms V/U decision, the WCT from (15) must be adjusted to as where β L = 1.2 and β H = 0.05 are experimentally determined.
The higher subbands contain less voiced information; reducing the WCs in higher subbands would suppress background noise. The higher subbands contain more significant information than the lower subbands do in an unvoiced frame. Hence, reserving the WCs of higher subbands can achieve a better performance by reducing the WCT in higher wavelet subbands shown as (17). The WCs corresponding to the lower subbands must be reduced to suppress the background noise.
From (18), it is known that if the energy of musical residual noise, σ 2 d (ξ, m), is greater than the NMT in a subband, the wavelet coefficient thresholds become small adjusted by the gain factor to suppress infecting noise. However, if the energy of residual noise is smaller than the NMT, the corrupting noise cannot be perceived by the human ear. We do not need to change the WCTs for retaining the speech quality.

Soft
Thresholding. The noise components are suppressed by soft thresholding wavelet packet coefficients of the noisy signal as follows: where sgn[·] is the sign function. w j ξ,m (k) is thresholded wavelet coefficient.

Inverse PWPT.
Finally, the speech signal is synthesized with the inverse transformation of the thresholded wavelet packet coefficients as follows: where PWPT −1 {·} means process of inverse PWPT.

Experimental Results
In this section, we select the speech database that contains 60 speech phrases (in Chinese Mandarin and in English) spoken by both male and female speakers. To set up the noisy signal for test, we added the prepared noise signals to the recorded speech signal with different SNRs range from −5 dB to 10 dB. A variety of nonstationary noises are taken from the Noisex-92 database [24] for experiments. All noisy signals are sampled at 8 kHz with 16 bits/sample. The frame size is 64 milliseconds and the frame shift is 32 milliseconds. To evaluate the performance of our algorithm, the methods including (1) speech enhancement using timeinvariant threshold (TI) [8], (2) speech enhancement using perceptual wavelet packet decomposition and Teager energy operator proposed (WPD+TEO) [10], (3) wavelet speech based on time-scale adaptation (TSA) [25], and (4) speech enhancement method using perceptually constrained gain factors in critical-band-WPT proposed (PER+WPT) [14] are compared to our proposed algorithm.
Several objective speech quality measures including segmental SNR (SegSNR) improvements, Itakura-Saito (IS) measure [26], and perceptual evaluation of speech quality (PESQ) [27][28][29] are tested to vary noise at the range [−5, 10] dB in this section. Table 1 shows the SegSNR improvements of the speech enhancement evaluations for different methods. The amounts of noise reduction, residual noise, and speech distortion can be measured by SegSNR improvement. Observing Table 1, the SegSNR improvements are used for the performance evaluations in different noise environments. The higher SegSNR improvements results show that the proposed method has much better enhancement performance than others. In addition, the perceptual gain factor offers the best performance at the lower SNR inputs in the proposed method. Table 2 shows the Itakura-Saito (IS) measure results of the speech enhancement. The majority of IS results show that the proposed method has the lower spectral distortion values than those of other methods at different SNR levels for various nonstationary noises such as factory and vehicle. The results of PESQ scores are performed by the actual human listeners among the algorithms and presented in Table 3. In Table 3, the comments in brackets are the scores of PESQ without thresholding process. It is observed that the proposed enhancement method produces the better improvement of quality speech than other methods especially for low SNR.

Conclusion
The proposed speech enhancement algorithm uses timefrequency wavelet threshold instead of traditional invariant and time-variant threshold. The wavelet coefficient threshold is adjusted according to the value of a posterior SNR. In addition, the V/U decision lets the WCT be different with voiced frame or unvoiced frame. A residual musical noise is successfully suppressed when a perceptual gain factor is adopted into the estimation of WCT. Experimental results 8 EURASIP Journal on Advances in Signal Processing show that the proposed method yields a higher improvement in SegSNR, lower IS measure, and higher PESQ scores than other methods under all tested environmental conditions.