5.1. The Initial Algorithm:The Martin's Method
A very challenging task of spectral subtraction speech enhancement algorithms is noise spectrum estimation. For estimating stationary noise specifications, the first 100–200 ms of each noisy signal are usually assumed pure noise and used to estimate the noise for over the time [31]. For estimation of nonstationary noise, the noise spectrum needs to be estimated and updated continuously. To do so, we need a voice activity detector (VAD) to find silence frames for updating noise estimation [32]. In a nonstationary noise case or low SNR situations, nonspeech/pause section detection reliability is a concern. In [18], the author proposes an algorithm that does not require explicit speech/pause detection and can update noise estimation even from noisy speech sections. The minimum statistics noise tracking method is based on the observation that even during speech activity a short-term power spectral density estimate of the noisy signal frequently decays to values that are representative of the noise power level. Thus, by tracking the minimum power within finite (D) PSD frames, large enough to bridge high power speech segments, the noise floor can be estimated [33].
The smoothed power spectrum of noisy speech
is calculated with a first-order recursive equation as follows:
where
and
are the frame and the frequency bin indices, respectively. η is a smoothing constant where value is to be set appropriately between zero and one. Often a constant value of 0.85 to 0.95 is suggested [33].
If
can be assumed stationary with a relatively small span of correlation and for a large frame size, the real and imaginary part of the Fourier transform coefficients,
, can be considered independent and modeled as zero mean Gaussian random variables [34]. Under this assumption, each periodogram bin is an exponentially distributed random variable. If the condition holds, an optimal smoothing constant derived in [33] can be employed that enhances the performance
where
, the true PSD of the noise, can be replaced by its latest estimate,
. More works on this subject have recently been reported in [35]. Dependency of the optimal value of η on
,
and noise Power Density Frequency (PDF) increases its computation burden while, its allowable range (0.85 to 0.95) is limited, and there is uncertainty about PDF of the (non stationary) noise. This justifies using an average value that is calculated occasionally, instead of the nonoptimal exact value computation in each iteration.
5.2. Noise Spectral Minimum Estimation
Since spectrum of noisy speech signal often decays to the spectrum of noise, we can get an estimate of the noise in a time window of about 0.8–1.4 s. This corresponds to finding the minimum among a number (D) of consecutive PSD s,
, as follows:
where i is the estimation iteration number. The calculated spectral minimum, then, is used in the future frames, (
), for spectral subtraction. The equation may be updated in every and each
step,
, then
compare operations are needed per step. However, if it is computed after every
consecutive PSD s,
, the number of compare operations lessens to about
operation per
step. In any case, if the current noisy speech power spectrum is smaller than
, the noise power is updated immediately:
However, in case of increase in noise power in the current frame, the update of the noise estimate is delayed by more than D spectral frames.
The estimate of
suffers from bias toward lower values that has to be compensated
In case of a relatively white
, bias compensation equations have been derived in [18, 33], with the one in [33] being as follows:
where
indicates the time of the previous
estimation. The equation indicates that the compensation constant is a function of time,
and frequency bin,
. However, its exact value will not be optimal for nonstationary situations. Deriving an average value, occasionally, and using it are a remedy that circumvents its computational costs and fits its nonoptimal value.
Incorporating the temporal specs of angle grinder noise in the algorithm has been elaborated in Section 5.2 while employing the frequency specs of noise power has been addressed in Section 5.3.
5.3. Fast Adapting Noise Estimation
To compensate the noise estimation delay, when the noise power jumps, the division of a D-PSD block into C-weighted M-PSD block is considered (
). It reduces the computational complexity and makes the adaptation faster [18]. The decomposition of the D-PSD block into C subblocks has the advantage that a new minimum estimate is available after already M samples without a substantial increase in operations.
The computation steps start with the calculation of the spectral minimum of the first M frame spectral minimum as follows:
Then,
for each of the other next
frames is determined. After the calculation of a set of
number of
, the next D-PSD spectral minimum is derived as follows:
D must be large enough to bridge any peak of speech activity, but short enough to follow nonstationary noise variations. Experiments with different speakers and modulated noise signals have shown that window lengths of approximately 0.8 s–1.4 s give good results [18].
Now, in case of increasing noise power in the current frame, the update of the noise estimate is delayed by
spectral frames. To speed up the tracking of the noise spectral minimum, an increase in the importance of the current sub-frame, with respect to the other past subframes is proposed
where
is a look-ahead constant with
. At the simplest case we have
. Also, for having an accurate noise spectral minimum estimation when a jump occurs in noise power, we modify (12) as follows:
where
is the relation-ahead parameter that is related to the segmental NSNR and
. At the simplest situation we set
. With increasing the value of
and
, the algorithm can track nonstationary noises well and the upper bound limit is preventing speech distortions. The above provisions are in close tie with the temporal specs of noise spectrum. In case of angle grinder, change in working conditions from nonengaged (stationary noise) to start of engagement (jump in noise power) to engaged (nonstationary) with part and vice versa shapes the dependency of the spectrum to time.
5.4. Multiband Fast Adapting Noise Spectral Estimation
In the case of angle grinder noise, the segmental SNR of high frequency band is significantly lower than the SNR of low frequency band; it implies that their noise variance is different. Another important point that should be considered here is that the high-energy first formant of vowels rests approximately on the frequency band between 400 and 1000 Hz. As a result, this band is not so much susceptible to noise spectrum coarse estimation. On the other hand, the upper frequency band that consonants occupy, the noise spectral estimate should be as precise as possible; otherwise, the intelligibility of speech is impaired. For these reasons, to enhance the performance of our algorithm, we divide the overall spectrum into four regions (0–400 Hz, 400–600 Hz, 600–1000 Hz, and above), and in compliance with (14), separate values for
and
are assigned to each of them. This is somehow similar to the study in [36] regarding colored noise. By this technique, diverse sensitivities in tracking nonstationary noise in the different frequency bands are employed. Hence, it is expected that reduction in the speech distortion and increases in the SNR of the processed speech are achieved. For good performance, lower values for
and
in the lower bands are suggested.