A Noise Reduction Preprocessor for Mobile Voice Communication

We describe a speech enhancement algorithm which leads to signiﬁcant quality and intelligibility improvements when used as a preprocessor to a low bit rate speech coder. This algorithm was developed in conjunction with the mixed excitation linear prediction (MELP) coder which, by itself, is highly susceptible to environmental noise. The paper presents novel as well as known speech and noise estimation techniques and combines them into a highly e ﬀ ective speech enhancement system. The algorithm is based on short-time spectral amplitude estimation, soft-decision gain modiﬁcation, tracking of the a priori probability of speech absence, and minimum statistics noise power estimation. Special emphasis is placed on enhancing the performance of the preprocessor in nonstationary noise environments


INTRODUCTION
With the advent and wide dissemination of mobile voice communication systems, telephone conversations are increasingly disturbed by environmental noise. This is especially true in hands-free environments where the microphone is far away from the speech source. As a result, the quality and intelligibility of the transmitted speech can be significantly degraded and fail to meet the expectations of mobile phone users. The environmental noise problem becomes even more pronounced when low bit rate coders are used in harsh acoustic environments. An example is the mixed excitation linear prediction (MELP) coder which operates at bit rates of 1.2 and 2.4 kbps. It is used for secure governmental communications and has been selected as the future NATO narrow-band voice coder [1]. In contrast to waveform approximating coders, low bit rate coders transmit parameters of a speech production model instead of the quan-tized acoustic waveform itself. Thus, low bit rate coders are more susceptible to a mismatch of the input signal and the underlying signal model.
It is well known that single microphone speech enhancement algorithms improve the quality of noisy speech when the noise is fairly stationary. However, they typically do not improve the intelligibility when the enhanced signal is presented directly to a human listener. The loss of intelligibility is mostly a result of the distortions introduced into the speech signal by the noise reduction preprocessor. However, the picture changes when the enhanced speech signal is processed by a low bit rate speech coder as shown in Figure 1. In this case, a speech enhancement preprocessor can significantly improve quality as well as intelligibility [2]. Therefore, the noise reduction preprocessor should be an integral component of the low bit rate speech communication system. Although many speech enhancement algorithms have been developed over the last two decades, such as Wiener and power-subtraction methods [3], maximum likelihood (ML) [4], minimum mean squared error (MMSE) [5,6], and others [7,8], improvements are still sought. In particular, since mobile voice communication systems frequently operate in nonstationary noise environments, such as inside moving vehicles, effective suppression of nonstationary noise is of vital importance. While most existing enhancement algorithms assume that the spectral characteristics of the noise change very slowly compared to those of the speech, this may not be true when communicating from a moving vehicle. Under such circumstances the noise may change appreciably during speech activity, and so confining the noise spectrum updates to periods of speech absence may adversely affect the performance of the speech enhancement algorithm. To maximize enhancement performance, the noise characteristics should be tracked even during speech. Most common enhancement techniques, including those cited above, operate in the frequency domain. These techniques apply a frequency-dependent gain function to the spectral components of the noisy signal, in an attempt to attenuate the noisier components to a greater degree. The gains applied are typically nonlinear functions of estimated signal and noise powers at each frequency. These functions are usually derived by either estimating the clean speech (e.g., the Wiener approach) or its spectral magnitude according to a specific optimization criterion (e.g., ML, MMSE). The noisesuppression properties of these enhancement algorithms have been shown to improve when a soft-decision modification of the gain function, which takes speech-presence uncertainty into account, is introduced [4,5,7,9]. To implement such a gain modification function, one must provide a value to the a priori probability of speech absence for each spectral component of the noisy signal. Therefore, we use the algorithm in [9] to estimate the a priori probability of speech absence as a function of frequency, on a frame-by-frame basis.
The objective of this paper is to describe a single microphone speech enhancement preprocessor which has been developed for voice communication in nonstationary noise environments with high quality and intelligibility requirements. Recently, this preprocessor has been proposed as an optional part of the future NATO narrow-band voice coder standard (also known as the MELPe coder [1]) and, in a slightly modified form, in conjunction with one of the ITU-T 4 kbps coder [10] proposals. The improvements we obtain with this system result from a synergy of several carefully designed system components. Significant contributions to the overall performance stem from a novel procedure for estimating the a priori probability of speech absence, and from a noise power spectral density (PSD) estimation algorithm with small error variance and good tracking properties.
A block diagram of the algorithm is shown in Figure 2. Spectral analysis consists of applying a window and the DFT. Spectral synthesis inverts the analysis with the IDFT and overlap-adding consecutive frames. The algorithm includes an MMSE estimator for the spectral amplitudes, a procedure for estimating the noise PSD, the long-term signal-to-noise ratio (SNR), and the a priori SNR, as well as a mechanism for the tracking of the a priori probability of speech absence. The spectral estimation procedure attenuates frequency components which contain primarily noise and passes those which contain mostly speech. As a result, the overall SNR of the processed speech signal is improved.
In the remainder of this paper we describe this algorithm in detail and evaluate its performance. In Section 2 we discuss windows for DFT-based spectral analysis and synthesis as well as the algorithmic delay of the joint enhancement and coding system. Sections 3, 4, and 5 present estimation procedures for the spectral coefficients and the long-term SNR. We outline the noise estimation algorithm [11] in Section 6, and summarize listening test results in Section 7. Section 8 concludes the paper. We reiterate that some components have been previously published [6,9,11,12]. Our goal here is to tie all required components together, thereby providing a comprehensive description of the MELPe enhancement system.

SPECTRAL ANALYSIS AND SYNTHESIS
Assuming an additive, independent noise model, the noisy signal y(n) is given by x(n) + d(n), where x(n) denotes the clean speech signal, and d(n) the noise. All signals are sampled at a sampling rate of f s . We apply a short-time Fourier analysis to the input signal by computing the DFT of each overlapping windowed frame, Here, M E denotes the frame shift, m ∈ Z is the frame index, k ∈ {0, 1, . . . , L − 1} is the frequency bin index, which is related to the normalized center frequency Ω k = k2π/L, and h( ) denotes the window function. Typical implementations of DFT-based noise reduction algorithms use a Hann window with a 50% overlap (M E /L = 0.5) or a Hamming window with a 75% overlap (M E /L = 0.25) for spectral analysis, and a rectangular window for synthesis. When no confusion is possible, we drop the frame index m and write the frequency index k as a subscript. Thus, for a given frame m we have where X k and Y k are characterized by their amplitudes A k and R k and their phases ϕ k and θ k , respectively, In the gain function derivations cited below, it is assumed that the DFT coefficients of both the speech and the noise are independent Gaussian random variables. The segmentation of the input signal into frames and the selection of an analysis window is closely linked to the frame alignment of the speech coder [12] and the admissible algorithmic delay. The analysis/synthesis system must balance conflicting requirements of sufficient spectral resolution, little spectral leakage, smooth transitions between signal frames, low delay, and low complexity. Delay and complexity constraints limit the overlap of the signal frames. However, the frame advancement must not be too aggressive so as to degrade the enhanced signal's quality. When the frame overlap is less than 50%, we obtain good results with a flat-top (Tukey) analysis window and a rectangular synthesis window.
The total algorithmic delay of the joint enhancement and coding system is minimized when the frame shift of the noise reduction preprocessor is adjusted such that l(L − M O ) = lM E = M C , with l ∈ N and where M C and M O denote the frame length of the speech coder and the length of the overlapping portions of the preprocessor frames, respectively. This situation is depicted in Figure 3.
The additional delay ∆ E , due to the enhancement preprocessor, is equal to M O . For the MELP coder and its frame length of M C = 180, we use an FFT length of L = 256 and have M O = 76 overlapping samples between adjacent signal frames.
Reducing the number of overlapping samples M O , and thus the delay of the joint system, has several effects. First, with a flat-top analysis window, this decreases the sidelobe attenuation during spectral analysis, which leads to increased crosstalk between frequency bins that might complicate the speech enhancement task. Most enhancement algorithms assume that adjacent frequency bins are independent and do not exploit correlation between bins. Second, as the overlap between frames is reduced, transitions between adjacent frames of the enhanced signal become less smooth. Discontinuities arise because the analysis window attenuates the input signal most at the ends of a frame, while estimation errors, which occur during the processing of the frame in the spectral domain, tend to spread evenly over the whole frame. This leads to larger relative estimation errors at the frame ends. The resulting discontinuities, which are most notable in low SNR conditions, may lead to pitch estimation errors and other speech coder artifacts.
These discontinuities are greatly reduced if we use a tapered window for spectral synthesis as well as one for spectral analysis [12]. We found that a tapered synthesis window is beneficial when the overlap M O is less than 40% of the DFT length L. In this case, the square root of the Tukey window can be used as an analysis and synthesis window. It results in a perfect reconstruction system if the signal is not modified between analysis and synthesis. Note that the use of a tapered synthesis window is also in line with the results of Griffin and Lim [13] for the MMSE reconstruction of modified short time spectra.

ESTIMATION OF SPEECH SPECTRAL COEFFICIENTS
Let C k be some function of the short-time spectral amplitude A k of the clean speech in the kth bin (e.g., A k , log A k , A 2 k ). Taking the uncertainty of speech presence into account, the MMSE estimator C k of C k is given by [4] where H k 0 and H k 1 represent the following hypotheses: (i) H k 0 : speech absent in kth DFT bin, (ii) H k 1 : speech present in kth DFT bin, and E{·|·} and P(·|·) denote conditional expectations and conditional probabilites, respectively. Since E{C k |Y k , H k P(H k 1 |Y k ) is thus the soft-decision modification of the optimal estimator under the signal presence hypothesis.
Applying Bayes' rule, one obtains [4,5] where p(·|·) represents conditional probability densities, and Λ k is a generalized likelihood ratio and q k denotes the a priori probability of speech absence in the kth bin. C k is then used to find an estimate of the clean signal spectral amplitude A k . If C k = A k , as for the MMSE amplitude estimator, one gets [5] where,Â SA (k) is the MMSE estimator of A k that takes into account speech presence uncertainty and, according to (6) and (7), The derivation of G SA (k) can be found in [5].

MMSE-LSA and MM-LSA estimators
Based on the results reported in [6], we prefer using the MMSE-LSA estimator (corresponding to C k = log A k ) over the MMSE-STSA (C k = A k ) estimator [5], as the basic enhancement algorithm. In this case the amplitude estimator has the form where, again, G M (k) is the gain modification function defined in (7) and satisfies, of course, (10) is not multiplicative and does not result in a meaningful improvement over using G LSA (k) alone [6], we choose to use the following estimator, which is called the multiplicatively modified LSA (MM-LSA) estimator [9]: It should be mentioned that in [14,15] the second term in (5) is not zeroed out, as we did in arriving at (6), but is rather constrained in such a way that (10) can be replaced by where G min is a threshold gain value [14,15]. This way, one gets an exact multiplicative modification of R k , by replacing the expression for G L (k) in (11) Since the computation of G L (k) according to (11) is simpler, and gives close results for a wide range of practical SNR values [15], we prefer to continue with (11).
Under the above assumptions on speech and noise, the gain function G LSA (k) is derived in [6] to be where, In [6], γ k is called the a posteriori SNR for bin k, η k is called the a priori SNR, and q k is the prior probability of speech absence discussed earlier (see (7)).
With the above definitions, the expression for Λ k in (7) is given by [5] In order to evaluate these gain functions, one must first estimate the noise power spectrum λ d . This is often done during periods of speech absence as determined by a voice activity detector (VAD), or, as we will show below using the minimum statistics [11] approach. The estimated noise spectrum and the squared input amplitude R 2 k provide an estimate for the a posteriori SNR. In [5,6], a decision-directed approach for estimating the a priori SNR is proposed: where 0 ≤ α η ≤ 1.
An important property of both the MMSE-STSA [5] and the MMSE-LSA [6] enhancement algorithms is that they do not produce musical noise [16] that plagues many other frequency-domain algorithms. This can be attributed to the above decision-directed estimation method for the a priori SNR [16]. To improve the perceived performance of the estimator, [16] recommends imposing a lower limit η MIN on the estimated η k , analogous to the use of a "spectral floor" in [17]. This lower limit depends on the overall SNR of the noisy speech and may be adaptively adjusted as outlined in Section 5. The parameter α η in (15) provides a trade-off between noise reduction and signal distortion. Typical values for α η range between 0.90 and 0.99, where at the lower end one obtains less noise reduction but also less speech distortion.
Before we consider the estimation of the prior probabilities, we mention that in order to reduce computational complexity, the exponential integral in (12) may be evaluated using the functional approximation below instead of iterative solutions or tables. Thus, to approximate we usẽ 26) for v > 1.
Since in (12) we need exp(0.5ei(v)), we show this function (solid line) alongside its approximation (dashed line) in Figure 4. For the present purpose this approximation is more than adequate.

Estimation of prior probabilities
A key feature of our speech enhancement algorithm is the estimation of the set of prior probabilities {q k } required in (12)   (17). and (14), where k is the frequency bin index. Our first objective is to estimate a fixed q (i.e., a frequency-independent value) for each frame that contains speech. The basic idea is to estimate the relative number of frequency bins that do not contain speech and use a short time average of this statistic as an estimate for q. Due to this averaging, the estimated q will vary in time and will serve as a control parameter in the above gain expressions.
The absence of speech energy in the kth bin clearly corresponds to η k = 0. However, since the analysis is done with a finite length window, we can expect some leakage of energy from other bins. In addition, the human ear is unable to detect signal presence in a bin if the SNR is below a certain level η min . In general, η min can vary in frequency and should be chosen in accordance with a perceptual masking model. Here we choose a constant η min for all the frequency bins, and set its value to the minimum level, η MIN , that the estimateη in (15) is allowed to attain. The values used in our work ranged between 0.1 and 0.2. It is interesting to note that the use of a lower threshold on the a priori SNR has a similar effect to constraining the gain, when speech is absent, to some G min , which is the basis for the derivation of the gain function in [14,15].
Due to the nonlinearity of the estimator for η k in (15), there is a "locking" phenomenon to η MIN when the speech signal level is low. Hence, one could consider using η MIN as a threshold value to whichη k is compared in order to decide whether or not speech is present in bin k. However, our attempt to use this threshold resulted in excessively high counts of noise-only bins, leading to high values of q (i.e., closer to one). This is easily noticed in the enhanced signal which suffers from an over-aggressive attenuation by the gain modification function G M (k).
We therefore turn our attention to the a posteriori SNR, γ k , defined in (12) and determined directly from the squared amplitude R 2 k , once an estimate for noise spectrum λ d (k) is given. Assuming that the DFT coefficients of the speech and noise are independent Gaussian random variables, the pdf of γ k for a given value of the a priori SNR, η k , is given by [5] To decide whether speech is present in the kth bin (in the sense that the true η k has a value larger or equal to η min ), we consider the following composite hypotheses: (H 0 ) η k ≥ η min (speech present in kth bin), (H A ) η k < η min (speech absent in kth bin).
We have chosen the null hypothesis (H 0 ) as stated above since its rejection when true is more grave than the alternative error of accepting when false. This is because the first type of error corresponds to deciding that speech is absent in the bin when it is actually present. Making this error would increase the estimated value of q, which would have a worse effect on the enhanced speech than if the value of q is under-estimated.
Since η k parameterizes the pdf of γ k , as shown in (18), γ k can be used as a test statistic. In particular, since the likelihood ratios that correspond to simple alternatives to the above two hypotheses for any η a k < η min , are monotonic functions in γ k (for γ k > 0 and any chosen η min > 0), it can be shown [18] that the likelihood ratio test for the following decision between two simple hypotheses is a uniformly most powerful test for our original problem: This gives the test where γ TH is set to satisfy a desired significance level [19] (or size [18]) α 0 of the test. That is, α 0 is the probability of rejecting (H 0 ) when true, and is therefore Substituting the pdf of γ k from (18), we obtain Let M be the number of positive frequency bins to consider. Typically, M = (L/2) + 1, where L is the DFT transform size. However, if the input speech is limited to a narrower band, M should be chosen accordingly. Let N q (m) be the number of bins out of the M examined bins in frame m for which the test in (20) results in the rejection of hypothesis (H 0 ). With r q (m) N q (m)/M, the proposed estimate for q(m) is formed by recursively smoothing r q (m) in time: The smoothing in (23) is performed only for frames which contain speech (as determined from a VAD). We selected the parameters based on informal listening tests. We noticed improved performance with α 0 = 0.5 (giving γ TH = 0.8 in (22)) and α q = 0.95 in (23). Yet, as discussed earlier, a better gain modification could be expected if we allow different q's in different bins. Let I(k, m) be an index function that denotes the result of the test in (20), in the kth bin of frame m. That is, I(k, m) = 1 if (H 0 ) is rejected, and I(k, m) = 0 if it is accepted. We suggest the following estimator for q(k, m): The same settings for γ TH and α q above are appropriate here also. This way, averagingq(k, m) over k in frame m results in theq(m) of (23).

VOICE ACTIVITY DETECTION AND LONG-TERM SNR ESTIMATION
The noise power estimation algorithm described in Section 6 does not rely on a VAD and therefore need not deal with detection errors. Nevertheless, it is beneficial to have a VAD available for controlling certain aspects of the preprocessor. In our algorithm we use VAD decisions to control estimates of the a priori probability of speech absence and of the longterm SNR. We briefly describe our delayed decision VAD and the long-term SNR estimation. As in [7] (see also [20]), we have found that the mean valueγ of γ k (averaged over all frequency bins in a given frame), is useful for indicating voice activity in each frame. For stationary noise and independent DFT coefficients,γ is approximately normal with mean 1 and standard deviation σγ = √ 1/M (for sufficiently large M, which is usually the case). Thus, by comparingγ to a suitable fixed threshold, one can obtain a reliable VAD-as long as the short-time noise spectrum does not change too fast. Typically, we use threshold valuesγ th in the range between 1.35 and 2, where the lower value, which we denote byγ min th , corresponds to 1 + 4σγ for M = L/2 + 1 with a transform size of L = 256 (32millisecond window). We found this value suitable for stationary noise at input SNR values down to 3 dB. The higher threshold value allows for larger fluctuations ofγ (as expected if the noise is nonstationarity) without causing a decision error in noise-only frames, but may result in misclassification of weak speech signals as noise, particularly at SNR values below 10 dB. We may further improve the VAD decision by considering the maximum of γ k , k = 0, . . . , M, and the average frame SNR. We declare a speech pause if γ <γ th , max k (γ k ) < γ max-th , and mean( η(k, m)) < 2γ th , where γ max-th ≈ 25γ th . Finally, we require a consistent VAD decision for at least two consecutive frames before taking action.
The long term signal-to-noise ratio SNR LT (m) characterizes the SNR of the noisy input speech averaged over periods of one to two seconds. It is used for the adaptive limiting of the a priori SNR and the adaptive smoothing of the signal power, as outlined below. The computation of SNR LT (m) requires a VAD since the average speech power can be updated only if speech is present. The signal power is computed using a first-order recursive system update on the average frame power with time constant T LT : where α LT ≈ 1 − M E /(T LT f s ). SNR LT (m) is then given by If SNR LT (m) is smaller than zero, it is set equal to SNR LT (m− 1), the estimated long-term SNR of the previous frame.

ADAPTIVE LIMITING OF THE A PRIORI SNR
After applying the noise reduction preprocessor described so far to the MELP coder, we found that most of the degradations in quality and intelligibility that we witnessed were due to errors in estimating the spectral parameters in the coder.
In this section, we present a modified spectral weighting rule which allows for better spectral parameter reproduction in the MELP coder, where linear predictive coefficients (LPC) are transformed into line spectral frequencies (LSF). We use an adaptive limiting procedure on the spectral gain factors applied to each DFT coefficient. We note that while spectral valleys in between formant frequencies are not important for speech perception (and thus can be filled with noise to give a better auditory impression), they are important for LPC estimation.
It was stressed in [9,16] that in order to avoid structured "musical" residual noise and achieve good audio quality, the a priori SNR estimateη k should be limited to values between 0.1 and 0.2. This means that less signal attenuation is applied to bins with low SNR in the spectral valleys between formants. By limiting the attenuation, we largely avoid the annoying "musical" distortions and the residual noise appears very natural. However, this attenuation distorts the overall spectral shape of speech sounds, which impacts the spectral parameter estimation. One solution to this problem is the adaptive limiting scheme we outline below.
We utilize the VAD to distinguish between speech-andnoise and noise-only signal frames. Whenever we detect pauses in speech, we set a preliminary lower limit for the a priori SNR estimate in the mth frame to η MIN 1 (m) = η min P (typically, η min P = 0.15) in order to achieve a smooth residual noise. During speech activity, the lower limit η MIN 1 (m) is set to η MIN 1 (m) = η min P 0.0067 0.5 + SNR LT (m) 0.65 (27) and is limited to a maximum of 0.25. We obtained (27) by fitting a function to data from listening tests using several long-term SNR values. We then smooth this result using a first-order recursive system, to obtain smooth transitions between active and pause segments. We use the resulting η MIN as a lower limit forη k . The enhanced speech sounds appear to be less noisy when using the adaptive limiting procedure, while at the same time the background noise during speech pauses is very smooth and natural. This method was also found to be effective in conjunction with other speech coders. A slightly different dynamic lower limit optimized for the 3GPP AMR coder [21] is given in [22].

NOISE POWER SPECTRAL DENSITY ESTIMATION
The importance of an accurate noise PSD estimate can be easily demonstrated in a computer simulation by estimating it directly from the isolated noise source. In fact, it turns out that many of the annoying artifacts in the processed signal are due to errors in the noise PSD estimate. It is therefore of paramount importance both to estimate the noise PSD with a small error variance and to effectively track nonstationary noise. This requires a careful balance between the degree of smoothing and the noise tracking rate. A common approach is to use a VAD and to update the estimated noise PSD during speech pauses. Since the noise PSD might also fluctuate during speech activity, VAD-based methods do not work satisfactorily when the noise is nonstationary or when the SNR is low. Soft-decision update strategies which take the probability of speech presence in each frequency bin into account [9,20] allow us to also update the noise PSD during speech activity, for example, in between the formants of the speech spectrum or in between the pitch peaks during voiced speech.
The approach we present here is based on the minimum statistics method [11,23] which is very robust, even for low SNR conditions. The minimum statistics approach assumes that speech and noise are statistically independent and that the spectral characteristics of speech vary faster in time than those of the noise. During both speech pauses and speech activity, the PSD of the noisy signal frequently decays to the level of the noise. The noise floor can therefore be estimated by tracking spectral minima within a finite time window without relying on a VAD decision. The noise PSD can be updated during speech activity, just as with soft-decision methods. An important feature of the minimum statistics method is its use of an optimally smoothed power estimate which provides a balance between the error variance and effective tracking properties.

Adaptive optimal short-term smoothing
To derive an optimal smoothing procedure for the PSD of the noisy signal, we assume a pause in speech and consider a first-order smoothing recursion for the short-term power of the DFT coefficients Y (k, m) of the mth frame (1), using a time-and frequency-dependent smoothing parameter α(k, m): Since we want λ y (k, m) to be as close as possible to the true noise PSD λ d (k, m), our objective is to minimize the conditional mean squared error from one frame to the next. After substituting (29) for λ y (k, m + 1) in (30) and using E{|Y (k, m)| 2 } = λ d (k, m) and E{|Y (k, m)| 4 } = 2λ 2 d (k, m), the mean squared error is given by where we also assumed the statistical independence of successive signal frames. Setting the first derivative with respect to α(k, m) to zero yields and the second derivative, being nonnegative, reveals that this is indeed a minimum. The term λ y (k, m)/λ d (k, m) = γ(k, m) on the right hand side of (32) is a smoothed version of the a posteriori SNR. Figure 5 plots the optimal smoothing parameter α opt for 0 ≤ γ ≤ 10. This parameter is between zero and one, thus guaranteeing a stable and nonnegative noise power estimate λ y (k, m). Assuming a pause in speech in the above derivation does not pose any major problems. The optimal smoothing procedure reacts to speech activity in the same way as to highly nonstationary noise. During speech activity, the smoothing parameter is small, allowing the PSD estimate to closely follow the time-varying PSD of the noisy speech signal.
To compute the optimal smoothing parameter in (32), we replace the true noise PSD λ d (k, m) with an estimate λ d (k, m). However, since the estimated noise PSD may be either too small or too large, we have to take special pre- cautions. If the computed smoothing parameter is smaller than the optimal value, the smoothed PSD estimate λ y (k, m) will have an increased variance. This is not a problem if the noise estimator is unbiased, since the smoothed PSD will still track the true signal PSD, and the estimated noise PSD will eventually converge to the true noise PSD. However, if the computed smoothing parameter is too large, the smoothed power will not accurately track the true signal PSD, leading to noise PSD estimation errors. We therefore introduce an additional factor α c (m) in the numerator of the smoothing parameter which decreases whenever deviations between the average smoothed PSD estimate and the average signal power are detected. Now the smoothing parameter has the form where α max is a constant smaller than but close to 1 and prevents the freezing of the PSD estimator. c max does not appear to be a sensitive parameter and was set to 0.7. Equation (35) ensures that the average smoothed power of the noisy signal cannot deviate by a large factor from the power of the current frame. The ratio of powers Ξ = L−1 k=0 λ y (k, m)/ L−1 k=0 |Y (k, m)| 2 in (35) is evaluated in terms of the soft weighting function α max /(1 + (Ξ − 1) 2 ), which we found very suitable for this purpose [11].
To improve the performance of the noise estimator in nonstationary noise environments, we found it necessary to also apply a lower limit α min to α(k, m). Since α min limits the rise and decay times of λ y (k, m), this lower limit is a function of the overall SNR of the speech sample. To avoid attenuating weak consonants at the end of a word we require λ y (k, m) to decay from its peak values to the noise level in about ∆T = 64 ms. Therefore, α min can be computed as (36)

The minimum tracking algorithm
If λ min (k, m) denotes the minimum of D consecutive PSD estimates λ y (k, ), = m − D + 1, . . . , m, an unbiased estimator of the noise PSD λ d (k, m) is given by where the bias compensation factor B min (D, Q(k, m)) can be approximated by [11,23] .
The unbiased estimator requires the knowledge of the degrees of freedom Q(k, m) of the smoothed PSD estimate λ y (k, m) at any given time and frequency index. In our context, Q(k, m) can attain noninteger values since the PSD is obtained via recursive smoothing and consecutive signal frames might be correlated. Since the variance of the smoothed PSD estimate λ y (k, m) is inversely proportional to Q(k, m), we compute 1/Q(k, m) as which then allows us to approximate B min (D, Q(k, m)) via (38).
To compute the variance of the smoothed PSD estimate λ y (k, m), we estimate the first and the second moments, E{ λ y (k, m)} and E{ λ 2 y (k, m)}, of λ y (k, m) by means of firstorder recursive systems, P(k, m + 1) = β(k, m)P(k, m) + 1 − β(k, m) λ y (k, m + 1), We choose β(k, m) = α 2 (k, m) and limit β(k, m) below 0.8. Finally, we estimate 1/Q(k, m) by and limit this estimate below 0.5. This limit corresponds to the minimum degrees of freedom, Q = 2, which we obtain when no smoothing is in effect (α(k, m) = 0). Furthermore, since the error variance of the minimum statistics noise estimator is larger than the error variance of an ideal moving average estimator [11], we increase the in- Q(k, m)) and a v typically set to a v = 1.5.

Tracking nonstationary noise
The minimum statistics method searches for the biascompensated minimum λ min (k, m) of D consecutive PSD estimates λ y (k, l), l = m − D + 1, . . . , m. For each frequency bin k, the D samples are selected by sliding a rectangular window over the smoothed power data λ y (k, l). Furthermore, we divide the window of D samples into U subwindows of V samples each (UV = D). This allows us to update the minimum of λ y (k, m) every V samples while keeping the computational complexity low. For every V samples read, we compute the minimum of the current subwindow and store it for later use. We obtain an overall minimum after considering all such subwindow minima. Also, we achieve better tracking of nonstationary noise when we take local minima in the vicinity of the overall minimum λ min (k, m) into account. For our purposes, we ignore subwindow minima where the minimum value is attained in the first or the last frame of a subwindow. Since (37) is a function of the window length, computing power estimates on the subwindow level requires a bias compensation for the minima obtained from subwindows as well (i.e., put D = V in (37)). A local (subwindow) minimum may then override the overall minimum λ min (k, m) when it is close to the overall minimum λ min (k, m) of the D consecutive power estimates. This procedure uses the spectral minima of the shorter subwindows for improved tracking. To reduce the likelihood of large estimation errors when using subwindow minima, we apply a threshold noise slope max to the difference between the subwindow minima and the overall minimum. This threshold depends on the normalized averaged variance Q −1 (m) of λ y (k, m) according to the procedure shown in Algorithm 1. A large update is only possible when the normalized averaged variance Q −1 (m) is small and hence when speech is most likely absent. Thus, we update the noise PSD estimate when a local minimum is found, and when the difference between the subwindow minimum and the overall minimum does not exceed the threshold noise slope max. A pseudocode program of the complete noise estimation algorithm is shown in Algorithm 2. All computations are embedded into loops over all frequency indices k and all frame indices m. Subwindow quantities are subscripted by sub; subwc is a sub-window counter which is initialized to subwc = V at the start of the program; actmin(k, m) and actmin sub(k, m) are the spectral minima of the current window and subwindow up to frame m, respectively.
We point out that the tracking of nonstationary noise is significantly influenced by this mechanism and may be improved (at the expense of speech signal distortion) by Algorithm 2: The minimum statistics noise estimation algorithm [11].
increasing the noise slope max threshold. We also note that it is important to use an adaptive smoothing parameter α(k, m) as in (33). Otherwise, for a high SNR and a fixed smoothing parameter close to 1, the estimated signal power will decay too slowly after a period of speech activity. Hence, the minimum search window might then be too small to track the noise floor without being biased by the speech.
Although the minimum statistics approach [11,23] was originally developed for a sampling rate of f s = 8000 Hz and a frame advance of 128 samples, it can be easily adapted to other sampling rates and frame advance schemes. The length D of the minimum search window must be set proportional to the frame rate. For a given sampling rate f s and frame advance M E , the duration of the time window for minimum search, D · M E / f s , should be equal to approximately 1.5 seconds. For U = 8 subwindows, we therefore use V = 0.1875 f s /M E , where x denotes the smallest integer larger than or equal to x. When a constant smoothing parameter [23] is used in (29), the length D of the window for minimum search must be at least 50% larger than that for the adaptive smoothing algorithm.

EXPERIMENTAL RESULTS
The evaluation of noise reduction algorithms using instrumental ("objective") measures is an ongoing research topic [24,25]. Frequently, quality improvements are evaluated in terms of (segmental) SNR and the achieved noise attenuation. These measures, however, can be misleading as speech signal distortions and unnatural-sounding residual noise are not properly reflected. Also, as long as the reduction of noise power is larger than the reduction of speech power, the performance with respect to these metrics may be improved by applying more attenuation to the noisy signal at the expense of speech quality. The basic noise attenuation versus speech distortion trade-off is application-and possibly listener-dependent. Even listening tests do not always lead to conclusive results, as was experienced during the standardization process of a noise reduction preprocessor for the ETSI/3GPP AMR coder [26,27]. Specifically, the outcome of these tests depends on whether an absolute category rating (ACR) or a comparison category rating (CCR) method is favored.
To capture the possible degradations of both the speech signal and the background noise, a multifaceted approach such as the well-established diagnostic acceptability measure (DAM) is useful. The DAM evaluates a large number of quality characteristics, including the nature of the residual background noise in the enhanced signal. Intelligibility tests are more conclusive and reproducible despite being rarely used. In our investigation, we evaluated intelligibility using the standard diagnostic rhyme test (DRT). For both tests, higher scores are an indication of better quality. More information about the DAM and the DRT may be found in [28].
While preliminary results for a floating-point implementation of the preprocessor were presented in [2], we summarize our results here for a 16-bit fixed-point implementation, used in conjunction with the MELP coder. We evaluate quality and intelligibility, respectively, all using DAM and DRT scores obtained via formal listening tests. To provide an additional reference, we compare the 2.4-kbps MELP coder using our enhancement preprocessor (denoted in [1] by MELPe) with the toll quality 8-kbps ITU-T coder, G.729a  (without a preprocessor). Compared to the results reported for the floating-point implementation [2], the fixed-point implementation scores about 2 points less on both the DAM and the DRT scales. Table 1 presents DAM scores for the MELPe and the G.729a coders without environmental noise. Clearly, the G.729a coder, operating at a much higher rate than the MELPe coder, delivers significantly better quality.
In the presence of vehicular noise with an average SNR of about 6 dB (Table 2), the MELPe scores significantly higher than the standalone MELP coder, the unprocessed signal, and the G.729a coder. Note that the G.729a achieves approximately the same DAM score as the unprocessed signal. Tables 3 and 4 show intelligibility results for the clean and noisy conditions. For the clean condition, the higher bit rate G.729a coder is clearly more transparent, but the intelligibility of the MELPe is surprisingly close. This reinforces the frequently made observation that high intelligibility can be achieved with low bit rate coders. For the noisy environment (Table 4), we find that the unprocessed (and unencoded) signal achieves the best intelligibility. The MELPe coder, containing the noise reduction preprocessor, results in a significant intelligibility improvement. These intelligibility improvements are mostly due to the conservative noise estimation algorithm which is unbiased for stationary noise but underestimates the noise floor for nonstationary noise [11]. More detailed results for different noise environments may be found in [29].

CONCLUSION
We have presented a noise reduction preprocessor based on MMSE estimation techniques and the minimum statistics noise estimation approach. The combination of these algorithms and the careful selection of parameters lead to a noise reduction preprocessor that achieves improvements both in quality and intelligibility when used with the 2.4 kbps MELP  coder. Thus, in the context of low bit rate coding, single microphone enhancement algorithms can result in intelligibility improvements. The loss of intelligibility is not as severe for high bit rate coders as for low bit rate coders, such as the MELP coder. We believe that the potential for further improving speech transmission in noisy conditions has not yet been fully exploited. Further improvements might be obtained by using optimal enhancement algorithms for the various parameters found in speech coders, such as the LPC coefficients, the pitch, and the representation of the prediction residual signal. Such an approach is proposed in [30]. Novel noise PSD and a priori SNR estimation procedures [14,15], as well as more realistic assumptions for the probability density functions of the speech and noise spectral coefficients [31,32], could also lead to improved performance.