Low Delay Noise Reduction and Dereverberation for Hearing Aids

A new system for single-channel speech enhancement is proposed which achieves a joint suppression of late reverberant speech and background noise with a low signal delay and low computational complexity. It is based on a generalized spectral subtraction rule which depends on the variances of the late reverberant speech and background noise. The calculation of the spectral variances of the late reverberant speech requires an estimate of the reverberation time (RT) which is accomplished by a maximum likelihood (ML) approach. The enhancement with this blind RT estimation achieves almost the same speech quality as by using the actual RT. In comparison to commonly used post-filters in hearing aids which only perform a noise reduction, a significantly better objective and subjective speech quality is achieved. The proposed system performs time-domain filtering with coefficients adapted in the non-uniform (Bark-scaled) frequency-domain. This allows to achieve a high speech quality with low signal delay which is important for speech enhancement in hearing aids or related applications such as hands-free communication systems.


Introduction
Algorithms for the enhancement of acoustically disturbed speech signals have been the subject of intensive research over the last decades, cf., [1][2][3]. The wide-spread use of mobile communication devices and, not at least, the introduction of digital hearing aids have contributed significantly to the interest in this field. For hearing impaired people, it is especially difficult to communicate with other persons in noisy environments. Therefore, speech enhancement systems have become an integral component of modern hearing aids. However, despite significant progress, the development of speech enhancement systems for hearing aids is still a very challenging problem due to the demanding requirements regarding computational complexity, signal delay and speech quality.
A common approach is to use a beamformer with two or three closely spaced microphones followed by a post-filter, e.g., [4,5]. An adaptive beamformer is often used, implemented by first-or second-order differential microphone arrays or a generalized sidelobe canceller (GSC), respectively, e.g., [5]. Due to the use of small microphone arrays, only a limited noise suppression can be achieved by this, especially for diffuse noise fields. Therefore, the output signal of the beamformer is further processed by a (Wiener) post-filter to achieve an improved noise suppression, e.g., [4][5][6][7]. A related approach is to use an extension of the GSC structure termed as speech distortion weighted multichannel Wiener filter [8,9]. This approach allows to balance the tradeoff between speech distortions and noise reduction and is more robust towards reverberation than a common GSC.
So far, such systems achieve only a very limited suppression of speech distortions due to room reverberation. Such impairments are caused by the multiple reflections and diffraction of the sound on walls and objects of a room. These multiple echoes add to the direct sound at the receiver and blur its temporal and spectral characteristics. As a consequence, reverberation and background noise reduce listening comfort and speech intelligibility, especially for hearing impaired persons [10,11]. Therefore, algorithms for a joint suppression of background noise and reverberation effects are of special interest for speech enhancement in hearing instruments. However, many proposals are less suitable for this application.
For example, dereverberation algorithms based on linear prediction such as [12] achieve mainly a reduction of early reflections and do not consider additive noise, 2 EURASIP Journal on Advances in Signal Processing while algorithms based on a time-averaging [13] exhibit a high signal delay. Coherence-based speech enhancement algorithms such as [14] or [15] can suppress background noise and reverberation, but they are rather ineffective if only two closely spaced microphones can be used. This problem can be alleviated to some extend by a noise classification and binaural processing [16] which, however, requires two hearing aid devices connected by a wireless data link. A single-channel algorithm for speech dereverberation and noise reduction has been proposed recently in [17]. However, this algorithm is less suitable for hearing aids due to its high computational complexity and signal delay as well as its strong speech distortions.
A more powerful approach for noise reduction and dereverberation is to use blind source separation (BSS), e.g., [18]. Such algorithms do not require a priori knowledge about the microphone positions or source locations. However, they depend on a full data link between the hearing aid devices and possess a high computational complexity. Therefore, further work remains to be done to integrate such algorithms into common hearing instruments [19].
In this contribution, a single-channel speech enhancement algorithm is proposed, which is more suitable for current hearing aid devices. It performs a suppression of background noise and late reverberant speech by means of a generalized spectral subtraction. The devised (post-)filter exhibits a low signal delay, which is important in hearing aids, e.g., to avoid comb filter effects. The calculation of the late reverberant speech energy requires (only) an estimate of the reverberation time (RT), which is accomplished by a maximum likelihood (ML) approach. Thus, no explicit speech modeling is involved in the dereverberation process as, e.g., in [20] such that an estimation of speech model parameters is not needed here.
The paper is organized as follows. In Section 2, the underlying signal model is introduced. The overall system for low delay speech enhancement is outlined in Section 3. The calculation of the spectral weights for noise reduction and dereverberation is treated in Section 4. An important issue is the determination of the spectral variances of the late reverberant speech, which in turn is based on an estimation of the RT. These issues are treated in Sections 4.2 and 4.3. The performance of the new system is analyzed in Section 5, and the main results are summarized in Section 6.

Signal Model
The distorted speech signal x(k) is assumed to be given by a superposition of the reverberant speech signal z(k) and additive noise v(k) where k marks the discrete time index. The received signal x(k) and the original (undisturbed) speech signal s(k) are related by with h r (n, k) representing the time-varying room impulse response (RIR) of (possibly infinite) length L R between source and receiver. The reverberant speech signal can be decomposed into its early and late reverberant components . ( The late reverberation causes mainly overlap-masking effects which are usually more detrimental for the speech quality than the "coloration" effects of early reflections. Here, the early reverberant speech z e (k) (and not s(k)) constitutes the target signal of our speech enhancement algorithm. This allows to suppress the late reverberant speech z l (k) and additive noise v(k) by modeling them both as uncorrelated noise processes and to apply known speech enhancement techniques, such as Wiener filtering or spectral subtraction, respectively. This concept, which has been introduced by Lebart et al. [21] and further improved by Habets [22], forms the basis of our speech enhancement algorithm. It is more practical for hearing aids as it avoids the high computational complexity and/or signal delay required by algorithms which strive for an (almost) complete cancellation of background noise and reverberation as, e.g., BSS.

Low Delay Filtering
A common approach for (single-channel) speech enhancement is to perform spectral weighting in the short-term frequency-domain. The DFT coefficients of the disturbed speech X(i, λ) are multiplied with spectral weights W i (λ) to obtain M enhanced speech coefficients where i denotes the frequency (channel) index and λ the subsampled time index λ = k/R . (The operation · returns the greatest integer value which is lower than or equal to the argument.) For block-wise processing, the downsampling rate R ∈ N corresponds to the frame shift and λ to the frame index. An efficient and common method to realize the shortterm spectral weighting of (3) is to use a polyphase network DFT analysis-synthesis filter-bank (AS FB) with subsampling which comprises the common overlap-add method as special case, [2,23]. A drawback of this method is that subband filters of high filter degrees are needed to achieve a sufficient stopband attenuation in order to avoid aliasing distortions, which results in a high signal delay. For hearing aids, however, an overall processing delay of less than 10 milliseconds is desirable to avoid comb filter effects, cf., [24]. Such distortions are caused by the superposition of a processed, delayed signal with an unprocessed signal which bypasses the hearing aid, e.g., through the hearing aid vent. This is especially problematic for devices with an "open fitting." Therefore, the algorithmic signal delay of the AS FB should be significantly below 10 ms. One approach to achieve a reduced delay is to design the prototype lowpass filter of the DFT filter-bank by numerical optimization with the design target to reduce the aliasing distortions with constrained signal delay, [25,26].
A significantly lower signal delay can be achieved by the concept of the filter-bank equalizer proposed in [27,28]. The adaptation of the coefficients is performed in the (uniform or non-uniform) short-term frequency-domain while the actual filtering is performed in the time-domain. A related approach has been presented independently in [29] for dynamic range compression in hearing aids. The concept of the filter-bank equalizer has been further improved and generalized in [30,31]. This filter(-bank) approach is considered here as it avoids aliasing distortions for the processed signal. In addition, the use of the warped filterbank equalizer causes a significantly lower computational complexity and signal delay than the use of a non-uniform (Bark-scaled) AS FB for speech enhancement as proposed, e.g., in [32][33][34].
A general representation of the proposed speech enhancement system is provided by Figure 1. The subband signals X(i, λ) are calculated either by a uniform or warped DFT analysis filter-bank with downsampling by R, which can be efficiently implemented by a polyphase network. The choice of the downsampling rate R is here not governed by restrictions for aliasing cancellation as for AS FBs since the filtering is performed in the time-domain with coefficients adapted in the frequency-domain. The influence of aliasing effects for the calculation of the spectral weights is negligible for the considered application.
The frequency warped version is obtained by replacing the delay elements of the system by allpass filters of first order This allpass transformation allows to design a filter-bank whose frequency bands approximate the Bark frequency bands (which model the frequency resolution of the human auditory system) with great accuracy [35]. This can be exploited for speech enhancement to achieve a high (subjective) speech quality with a low number of frequency channels, cf., [30]. The short-term spectral coefficients of the disturbed speech X(i, λ) are used to calculate the spectral weights for speech enhancement W i (λ) as well as the weights W i (λ) for speech denoising prior to the RT estimation, see Figure 1. These spectral weights are converted to the time-domain filter coefficients w n (λ) and w n (λ) by means of a generalized discrete Fourier transform (GDFT) and accordingly for the weights W i (λ). The sequence h(n) denotes the real, finite impulse response (FIR) of the prototype lowpass filter of the analysis filter-bank. For the common case of a prototype filter with linear phase response and even filter degree L, (5) applies with n 0 = L/2. The GDFT of (5) can be efficiently calculated by the fast Fourier transform (FFT). It is also possible to approximate the (uniform or warped) time-domain filters by FIR or IIR filters of lower degree to further reduce the overall signal delay and complexity. A more comprehensive treatment can be found in [30,31].

Spectral Weights for Noise Reduction and Dereverberation
Two essential components of Figure 1 are the calculation of the spectral weights and the RT estimation which are treated in this section.

4.1.
Concept. The weights are calculated by the spectral subtraction rule This method achieves a good speech quality with low computational complexity, but other, more sophisticated estimators such as the spectral amplitude estimators of Ephraim and Malah [36] or even psychoacoustic weighting rules [37] can be employed as well, cf., [22]. The spectral weights of (6) depend on an estimation of the a posteriori signal-to-interference ratio (SIR) The spectral variances of the late reverberant speech and noise are given by σ 2 zl (i, λ) and σ 2 v (i, λ), cf., (1) and (2). Equation (6) can be seen as a generalized spectral subtraction rule. If no reverberation is present, that is, z(k) = s(k), (7) reduces to the well-known a posteriori signal-to-noise ratio (SNR) and (6) to a "common" spectral magnitude subtraction for noise reduction.
The problem of musical tones can be alleviated by expressing the a posteriori SIR by the a priori SIR which can be estimated by the decision-directed approach of [36] ξ with 0.8 < η < 1. This recursive estimation of the a priori SIR causes a significant reduction of musical tones, cf., [38]. The spectral weights are finally confined by a lower threshold 4 EURASIP Journal on Advances in Signal Processing  This allows to balance the tradeoff between the amount of interference suppression on the one hand, and musical tones and speech distortions on the other hand. Alternatively, it is also possible to bound the spectral weights implicitly by imposing a lower threshold to the estimated a priori SIR. The adaptation of the thresholds and other parameters can be done similar as for "common" noise reduction algorithms based on spectral weighting.

Interference Power Estimation.
A crucial issue is the estimation of the variances of the interfering noise and late reverberant speech to determine the a priori SIR. The spectral noise variances σ 2 v (i, λ) can be estimated by common techniques such as minimum statistics [39].
An estimator for the variances σ 2 zl (i, λ) of the late reverberant speech can be obtained by means of a simple statistical model for the RIR of (1) [21] h m (k) = n(k)e −ρkTs (k) (11) with (k) representing the unit step sequence. The parameter T s = 1/ f s denotes the sampling period and n(k) is a sequence of i.i.d. random variables with zero mean and normal distribution.
The reverberation time (RT) is defined as the time span in which the energy of a steady-state sound field in a room decays 60 dB below its initial level after switching-off the excitation source, [40]. It is linked to the decay rate ρ of (11) by the relation T 60 = 3 ρ log 10 (e) ≈ 6.908 ρ .
Due to this dependency, the terms decay rate and reverberation time are used interchangeably in the following. The RIR model of (11) is rather coarse, but allows to derive a simple relation between the spectral variances of late reverberant speech σ 2 zl (i, λ) and reverberant speech σ 2 z (i, λ) according to [21] The value ν(i, λ) denotes the frequency and time dependent decay rate of the RIR in the subband-domain whose blind estimation is treated in Section 4.3. The integer value N l = T l f s /R marks the number of frames corresponding to the chosen time span T l where f s denotes the sampling frequency. The value for T l is typically in a range of 20 to 100 ms and is related to the time span after which the late reverberation (presumably) begins.
The variances of the reverberant speech can be estimated from the spectral coefficients Z(i, λ) by recursive averaging with 0 < κ < 1. The spectral coefficients of the reverberant speech are obtained by spectral weighting using, for instance, the spectral subtraction rule of (6) based on an estimation of the a posteriori SNR. It should be noted that the spectral weights W i (λ) are also needed for the denoising prior the the RT estimation (see Figure 1). A more sophisticated (and complex) estimation of the late reverberant speech energy is proposed in [22]. It takes model inaccuracies into account, if the source-receiver distance is lower than the critical distance and requires an estimation of the direct-to-reverberation ratio for this.

Decay Rate Estimation.
The estimation of the frequency dependent decay rates ν(i, λ) of (13) requires nonsubsampled subband signals, which causes a high computational complexity. To avoid this, we estimate the decay rate in the time-domain at decimated time instants λ = k/R from the (partly) denoised, reverberant speech signal z(k) as sketched by Figure 1. The prime indicates that the update rate for this estimation R is not necessarily identical to that for the spectral weights W i (λ) and W i (λ). In general, the update intervals for the RT estimation can be longer than for the calculation of the spectral weights as the room acoustics changes usually rather slowly.
The filter coefficients w n (λ) for the "auxiliary" timedomain filter which provides z(k) are obtained by a GDFT of the spectral weights W i (λ) used in (15), see Figure 1. The frequency dependent decay rates ν(i, λ ), needed to evaluate (13), are obtained by the time-domain estimate of the decay rate ρ(λ ) according to This approximation is rather coarse, but it yields good results in practice with a low computational complexity.
A blind estimation of the decay rate (or RT) can be performed by a maximum likelihood (ML) approach first proposed in [41,42]. A generalization of this approach to estimate the RT in noisy environments has been presented in [43]. The ML estimators are also based on the statistical RIR model of (11).
For a blind determination of the RT, an ML estimation for the decay rate ρ is performed at decimated time instants λ on a frame with N samples z(λ R − N + 1), z(λ R − N + 2), . . . , z(λ R ) according to with the log-likelihood function given by where a = exp{−ρT s }, cf., [43]. The corresponding RT is obtained by (12). A correct RT estimate can be expected, if the current frame captures a free decay period following the sharp offset of a speech sound. Otherwise, an incorrect RT is obtained, e.g., for segments with ongoing speech, speech onsets or gradually declining speech offsets. Such estimates can be expected to overestimate the RT, since the damping of sound cannot occur at a rate faster than the free decay. However, taking the minimum of the last K l ML estimates is likely to underestimate the RT, since the ML estimate constitutes also a random variable. This bias can be reduced by "order-statistics" as known from image processing [44]. In the process, the histogram of the K l most recent ML estimates is built and its first local maximum is taken as RT estimate T (peak) 60 (λ ) excluding maxima at the boundaries. The effects of "outliers" can be efficiently reduced by recursive smoothing with 0.9 < β < 1. A strong smoothing can be applied as the RT changes usually rather slowly over time.
The devised RT estimation relies only on the fact that speech signals contain occasionally distinctive speech offsets, but it requires no explicit speech offset detection [21] or a calibration period [45]. Another important advantage of this RT estimation is that it is developed for noisy signals as the prior denoising can only achieve a partial noise suppression.  In principle, it is also conceivable to use other methods for the continuous RT estimation, such as the Schroeder method [46] or a non-linear regression approach [47]. However, the use of such estimators has lead to inferior results as the obtained histograms showed a higher spread and less distinctive local maxima. This resulted in a much higher error rate in comparison to the ML approach.

Evaluation
The new system has been evaluated by means of instrumental quality measures as well as informal listening tests. The distorted speech signals are generated according to (1) for a sampling frequency of f s = 16 kHz. A speech signal of 6 minutes duration is convolved with a RIR shown in Figure 2. The RIR has been measured in a highly reverberant room and possesses a RT of 0.79 s. (This value for T 60 has been determined from the measured RIR by a modified Schroeder method as described in [43].) The reverberant speech signal z(k) is distorted by additive babble noise from the NOISEX-92 database with varying global input SNRs for anechoic speech s(k) and additive noise v(k).
For the processing according to Figure 1, a warped filterbank equalizer is used with allpass coefficient α = 0.5, M = 32 frequency channels, a downsampling rate of R = 32 and a Hann prototype lowpass filter of degree L = M. This processing with non-uniform frequency resolution allows to achieve a good subjective speech quality with low signal delay, cf., [30]. The time-invariant group delay of both warped time-domain filters is shown in Figure 3. The group delay varies only between 0.5 ms and 3.125 ms for f s = 16 kHz. Such variations do not cause audible phase distortions so that a phase equalizer is not needed here. In contrast, the use of a corresponding warped AS FB yields not only a significantly higher signal delay but requires also a phase equalization, see [31].
The spectral weights are calculated by the spectral subtraction rule of (6) using the thresholding of (10) with δ w (i, λ) ≡ 0.2 for the weights W i (λ) and δ w (i, λ) ≡ 0.1 for the weights W i (λ). The spectral noise variances are estimated by minimum statistics [39] and the variances of the late reverberant speech by (13). For the blind estimation of the RT according to Section 4.3, a histogram size of K l = 400 values and an adaptation rate of R = 256 are used. A smoothing factor of β = 0.995 is employed for (19).
The quality of the enhanced speech is evaluated in the time-domain by means of the segmental signal-to-interference ratio (SSIR) (cf., [48]). The difference between the anechoic speech signal of the direct path s d (k) and the processed speech y(k) = s(k) (after group delay equalization) is expressed by The set F s contains all frame indices corresponding to frames with speech activity and C(F s ) represents its total number of elements.
The speech quality is also evaluated in the frequencydomain by means of the mean log-spectral distance (LSD) between the anechoic speech of the direct path and the processed speech according to where S d (i, l) and S y (i, l) denote the short-term DFT coefficients of anechoic and processed speech for frequency index i and frame l. The lower threshold δ LSD confines the dynamic range of the log-spectrum and is set here to −50 dB. Halfoverlapping frames with N f = 256 samples are used for the evaluations.
A perceptually motivated spectral distance measure is given by the Bark spectral distortion (BSD) [49]. The Bark spectrum is calculated by three main steps: critical band filtering, equal loudness pre-emphasis and a phone-to-sone conversion. The BSD is obtained by the mean difference between the Bark spectra of undistorted speech B sd (i, l) and enhanced speech B y (i, l) according to A modification of this measure is given by the modified Bark spectral distortion (MBSD) which takes also into account the noise masking threshold of the human auditory system [50]. The (M)BSD has been originally proposed for the evaluation of speech codecs, but it can also be used as (additional) quality measure for speech enhancement systems, cf., [22]. The curves for the different measures are plotted in Figure 4. The joint suppression of late reverberant speech and noise yields a significantly better speech quality, in terms of a lower LSD and MBSD as well as a higher SSIR, in comparison to the noise reduction without dereverberation where σ zl (i, λ) = 0 for (8) and (9), respectively. (Using the cepstral distance (CD) measure led to almost identical results as for the LSD measure.) For low SNRs, the dereverberation effect becomes less significant due to the high noise energy, cf., (8). This is a desirable effect as the impact of reverberation is (partially) masked by the noise in such cases. For high SNRs, the noise reduction alone still achieves a slight improvement as the noise power estimation does not yield zero values. The estimation errors of the blind RT estimation are small enough to avoid noteworthy impairments; the curves for speech enhancement with blind RT estimation are almost identical to those obtained by using the actual RT. (Using other RIRs and noise sequences led to the same results.) Therefore, the new speech enhancement system achieves a speech quality as the comparable approach of [22] which, however, assumes that a reliable estimate of the RT is given (and considers a common DFT AS FB).
The results of the instrumental measurements comply with our informal listening tests. The new speech enhancement system achieves a significant reduction of background noise and reverberation, but still preserves a natural sound impression. The speech signals enhanced with blind RT estimation and known RT have revealed no audible differences. The noise reduction alone achieves only a slightly audible reduction of reverberation.

Conclusions
A new speech enhancement algorithm for the joint suppression of late reverberant speech and background noise is proposed which addresses the special requirements of hearing aids. The enhancement is performed by a generalized spectral subtraction which depends on estimates for the spectral variances of background noise and late reverberant speech. The spectral variances of the late reverberant speech are calculated by a simple rule in dependence of the RT. The time-varying RT is estimated blindly (without dedicated excitation signals) from a noisy and reverberant speech signal by means of an ML estimation and order-statistics filtering.
In reverberant and noisy environments, the devised single-channel speech enhancement system achieves a significant reduction of interferences due to late reverberation EURASIP Journal on Advances in Signal Processing and additive noise. The enhancement with the blind RT estimation achieves actually the same speech quality as by using the actual RT.
In contrast to existing algorithms for dereverberation and noise reduction, the proposed algorithm has a low signal delay, a reasonable computational complexity and it requires no (large) microphone array, which is of particular importance for speech enhancement in hearing aids. In comparison to commonly used post-filters in hearing aids which only perform noise reduction, a significantly better subjective and objective speech quality is achieved by the devised system.
Although the use for hearing instruments has been considered primarily here, the proposed algorithm is also suitable for other applications such as speech enhancement in hands-free devices, mobile phones or speech recognition systems.