Multichannel Direction-Independent Speech Enhancement Using Spectral Amplitude Estimation

This paper introduces two short-time spectral amplitude estimators for speech enhancement with multiple microphones. Based on joint Gaussian models of speech and noise Fourier coe ﬃ cients, the clean speech amplitudes are estimated with respect to the MMSE or the MAP criterion. The estimators outperform single microphone minimum mean square amplitude estimators when the speech components are highly correlated and the noise components are su ﬃ ciently uncorrelated. Whereas the ﬁrst MMSE estimator also requires knowledge of the direction of arrival, the second MAP estimator performs a direction-independent noise reduction. The estimators are generalizations of the well-known single channel MMSE estimator derived by Ephraim and Malah (1984) and the MAP estimator derived by Wolfe and Godsill (2001), respectively.


INTRODUCTION
Speech communication appliances such as voice-controlled devices, hearing aids, and hands-free telephones often suffer from poor speech quality due to background noise and room reverberation. Multiple microphone techniques such as beamformers can improve the speech quality and intelligibility by exploiting the spatial diversity of speech and noise sources. Upon these techniques, one can differentiate between fixed and adaptive beamformers.
A fixed beamformer combines the noisy signals by a time-invariant filter-and-sum operation. The filters can be designed to achieve constructive superposition towards a desired direction (delay-and-sum beamformer) or in order to maximize the SNR improvement (superdirective beamformer) [1,2,3].
Adaptive beamformers commonly consist of a fixed beamformer towards a fixed desired direction and an adaptive null steering towards moving interfering sources [4,5].
All beamformer techniques assume the target direction of arrival (DOA) to be known a priori or assume that it can be estimated sufficiently enough. Usually the performance of such a beamforming system decreases dramatically if the DOA knowledge is erroneous. To estimate the DOA during runtime, time difference of arrival (TDOA)-based locators evaluate the maximum of a weighted cross correlation [6,7]. Subspace methods have the ability to detect multiple sources by decomposing the spatial covariance matrix into a signal and a noise subspace. However, the performance of all DOA estimation algorithms suffers severely from reverberation and directional or diffuse background noise.
Single microphone speech enhancement frequency domain algorithms are comparably robust against reverberation and multiple sources. However, they can achieve high noise reduction only at the expense of moderate speech distortion. Usually, such an algorithm consists of two parts. Firstly, a noise power spectral density estimator based on the assumption that the noise is stationary to a much higher degree than the speech. The noise power spectral density can be estimated by averaging discrete Fourier transform (DFT) periodograms in speech pauses using a voice activity detection or by tracking minima over a sliding time window [8]. Secondly, an estimator for the speech component of the noisy signal with respect to an error criterion. Commonly, a Wiener filter, the minimum mean square error (MMSE) estimator of the speech DFT amplitudes [9], or its logarithmic extension [10] are applied.
In this paper, we propose the extensions of two single channel speech spectral amplitude estimators for the use in microphone array noise reduction. Clearly, multiple noisy signals offer a higher-estimation accuracy possibility when the desired signals are highly correlated and the noise components are uncorrelated to a certain degree. The main contribution will be a joint speech estimator that exploits the benefits of multiple observations but achieves a DOAindependent speech enhancement. Figure 1 shows an overview of the multichannel noise reduction system with the proposed speech estimators. The noisy time signals y i (k), i ∈ {1, . . . , M}, from M microphones are transformed into the frequency domain. This is done by applying a window h(µ), for example, a Hann window, to a frame of K consecutive samples and by computing the DFT on the windowed data. Before the next DFT computation, the window is shifted by Q samples. The resulting complex DFT values Y i (λ, j) are given by Here, k denotes the DFT bin and λ the subsampled time index. For the sake of brevity, k and λ are omitted in the following. The noisy DFT coefficient Y i consists of complex speech S i = A i e jαi and noise N i components: The noise variances σ 2 Ni are estimated separately for each channel and are fed into a speech estimator. If M = 1, the minimum mean square short-time spectral amplitude (MMS-STSA) estimator [9], its logarithmic extension [10], or less complex maximum a posteriori (MAP) estimators [11] can be applied to calculate real spectral weights G 1 for each frequency. If M > 1, a joint estimator can exploit information from all M channels using a joint statistical model of the DFT coefficients after IFFT and overlap-add M noisereduced signals are synthesized. Since the phases are not modified, a beamformer could be applied additionally after synthesis.
The remainder of the paper is organized as follows. Section 2 introduces the underlying statistical model of multichannel Fourier coefficients. In Section 3, two new multichannel spectral amplitude estimators are derived. First, a minimum mean square estimator that evaluates the expectation of the speech spectral amplitude conditioned on all noisy complex DFT coefficients is described. Secondly, a MAP estimator conditioned on the joint observation of all noisy amplitudes is proposed. Finally, in Section 4, the performance of the proposed estimators in ideal and realistic conditions is discussed.

STATISTICAL MODELS
Motivated by the central limit theorem, real and imaginary parts of both speech and noise DFT coefficients are usually modelled as zero-mean independent Gaussian [9,12,13] with equal variance. Recently, MMSE estimators of the complex DFT spectrum S have been developed with Laplacian or Gamma modelling of the real and imaginary parts of the speech DFT coefficients [14]. However, for MMSE or MAP estimation of the speech spectral amplitude, the Gaussian model facilitates the derivation of the estimators. Due to the unimportance of the phase, estimation of the speech spectral amplitude instead of the complex spectrum is more suitable from a perceptual point of view [15].
The Gaussian model leads to Rayleigh distributed speech amplitudes A i , that is, Here, σ 2 Si describes the variance of the speech in channel i. Moreover, the pdfs of the noisy spectrum Y i and noisy amplitude R i conditioned on the speech amplitude and phase are Gaussian and Ricians, respectively, Here, I 0 denotes the modified Bessel function of the first kind and zeroth order. To extend this statistical model for multiple noisy signals, we consider the typical noise reduction scenario of Figure 2, for example, inside a room or a car. A desired signal s arrives at a microphone array from angle θ.
Multiple noise sources arrive from various angles. The resulting diffuse noise field can be characterized by its coherence function. The magnitude squared coherence (MSC) between two omnidirectional microphones i and j of a diffuse noise field is given by Figure 3 plots the theoretical coherence of an ideal diffuse noise field and the measured coherence of the noise field inside a crowded cafeteria with a microphone distance of d i j = 12 cm. For frequencies above f 0 = c/2d i j , the MSC becomes very low and thus the noise components of the noisy spectra can be considered uncorrelated with Hence, (5) and (4) can be extended to for each n ∈ {1, . . . , M}. We assume the time delay of the speech signals between the microphones to be small compared to the short-time stationarity of speech and thus assume the speech spectral amplitudes A i to be highly correlated. However, due to near-field effects and different microphone amplifications, we allow a deviation of the speech amplitudes by a constant channel-dependent factor c i , that is, The joint pdf of all noisy amplitudes R i given the speech amplitude of channel n can then be written as where the c i 's are fixed parameters of the joint pdf. Similarly, the pdf of all noisy spectra Y i conditioned on the clean speech amplitude and phase is The unknown phases α i can be expressed by α n , the DOA, and the DFT frequency. In analogy to the single channel MMSE estimator of the speech spectral amplitudes, the resulting joint estimators will be formulated in terms of a priori and a posteriori SNRs whereas the a posteriori SNRs γ i can be directly computed, the a priori SNRs ξ i are recursively estimated using the estimated speech amplitudeÂ i of the previous frame [9]: The smoothing factor α controls the trade-off between speech quality and noise reduction [16].

MULTICHANNEL SPECTRAL AMPLITUDE ESTIMATORS
We derive Bayesian estimators of the speech spectral amplitudes A n , n ∈ {1, . . . , M}, using information from all M channels. First, a straightforward multichannel extension of the well-known MMSESTSA by Ephraim and Malah [9] is derived. Second, a practically more useful MAP estimator for DOA-independent noise reduction is introduced. All estimators output M spectral amplitudes A n and thus M-enhanced signals are delivered by the noise reduction system.

Estimation conditioned on complex spectra
The single channel algorithm for channel number n derived by Ephraim and Malah calculates the expectation of the speech spectral amplitude A conditioned on the observed complex Fourier coefficient Y n , that is, E{A n |Y n }. In the multichannel case, we can condition the expectation of each of the speech spectral amplitudes A n on the joint observation of all M noisy spectra Y i . To estimate the desired spectral amplitude of channel n, we have to calculatê This estimator can be expressed via Bayesian rule aŝ To solve (15), we assume perfect DOA correction, that is, (9) and (4), the integral over α in (15) becomes The sum of sine and cosine is a cosine with different amplitude and phase: Since we integrate from 0 to 2π, the phase shift is meaningless. With and π 0 exp{z cos x}dx = πI 0 (z), the integral becomes The remaining integrals over A n can be solved using [17, equation (6.631.1)]. After some straightforward calculations, the gain factor for channel n is expressed as where F 1 denotes the confluent hypergeometric series and Γ the Gamma function. The argument of F 1 contains a sum of a priori and a posteriori SNRs with respect to the noisy phases ϑ i , i ∈ {1, . . . , M}. The confluent hypergeometric series F 1 has to be evaluated only once since the argument is independent of n. Note that in case of M = 1, (21) is the single channel MMSE estimator derived by Ephraim and Malah. In a practical real-time implementation, the confluent hypergeometric series is stored in a table.

Estimation conditioned on spectral amplitudes
The assumption α i := α, i ∈ {1, . . . , M}, introduces a DOA dependency since this is only given for speech from θ = 0 • or after perfect DOA correction. For a DOA-independent speech enhancement, we condition the expectation of A n on the joint observation of all noisy amplitudes R i , that is, When the time delay of the desired signal s in Figure 2 between the microphones is small compared to the short-time stationarity of speech, the noisy amplitudes R i are independent of the DOA θ. Unfortunately, after using (10), we have to integrate over a product of Bessel functions, which leads to extremely complicated expressions even for the simple case M = 2.
Therefore, searching for a closed-form estimator, we investigate a MAP solution which has been characterized in [11] as a simple but effective alternative to the mean square estimator in the single channel application.
We search for the speech spectral amplitudeÂ n that maximizes the pdf of A n conditioned on the joint observation of all R i , i ∈ {1, . . . , M}: We need to maximize only It is however easier to maximize log(L), without effecting the result, because the natural logarithm is a monotonically increasing function.
Using (10) and (3), we get A closed-form solution can be found if the modified Bessel function I 0 is considered asymptotically with Figure 4 shows that the approximation is reasonable for larger arguments and becomes erroneous only for very low SNRs. Thus the term in the likelihood function containing the Bessel function is simplified to Differentiation of log L and multiplication with the amplitude A n results in A n (∂(log L)/∂A n ) = 0: This quadratic expression can have two zeros; for M > 2, it is also possible that no zero is found. In this case, the apex of the parabolic curve in (26) is used as an approximation identical to the real part of the complex solution. The resulting gain factor of channel n is given as For the calculation of the gain factors, no exotic function needs to be evaluated any more. Also, Re{·} has to be calculated only once since the argument is independent of n. Again, if M = 1, we have the single channel MAP estimator as given in [11].

EXPERIMENTAL RESULTS
In this section, we compare the performance of the joint speech spectral amplitude estimators with the well-known single channel Ephraim and Malah algorithm. Both M single channel estimators and the joint estimators output Menhanced signals. In all experiments, we do not apply additional (commonly used) soft weighting techniques [9,13] in order to isolate the benefits of the joint speech estimators compared to the single channel MMSE estimator. All estimators were embedded in the DFT-based noise reduction system in Figure 1. The system operates at a sampling frequency of f s = 20 kHz using half-overlapping Hann windowed frames. Both noise power spectral density σ 2 Ni and variance of speech σ 2 Si were estimated separately for each channel. For the noise estimation task, we applied an elaborated version of minimum statistics [8] with adaptive recursive smoothing of the periodograms and adaptive bias compensation that is capable of tracking nonstationary noise even during speech activity.
To measure the performance, the noise reduction filter was applied to speech signals with added noise for different SNRs. The resulting filter was then utilized to process speech and noise separately [18]. Instead of only considering the segmental SNR improvement obtained by the noise reduction algorithm, this methods allows separate tracking of speech quality and noise reduction amount. The tradeoff between speech quality and noise reduction amount can be regulated by, for example, changing the smoothing factor for the decision-directed speech power spectral density estimation (13). The speech quality of the noise-reduced signal was measured by averaging the segmental speech SNR between original and processed speech over all M channels. On the other hand, the amount of noise reduction was measured by averaging segmental input noise power divided by output noise power. Although the results presented here were produced with offline processing of generated or recorded signals, the system is well suited for real-time implementation.
The computational power needed is approximately M times that of one single channel Ephraim-Malah algorithm since for each microphone signal, an FFT, an IFFT, and an identical noise estimation algorithm are needed. The calculation of the a posteriori and a priori SNR (12) and (13) is also done independently for each channel. The joint estimators following (21) and (27) hardly increase the computational load, especially because Re(·) and F 1 (·) need to be calculated only once per frame and frequency bin.

Performance in artificial noise
To study the performance in ideal conditions, we first utilize the estimators on identical speech signals disturbed by spatially uncorrelated white noise. Figures 5 and 6 plot noise reduction and speech quality of the noise-reduced signal averaged over all M microphones for different number of microphones. While in Figure 5 the multichannel MMSE estimators according to (21) were applied, Figure 6 shows the performance of the multichannel MAP estimators according to (27). All joint estimators provide a significant higher speech quality and noise attenuation than the single channel MMSE estimator. The performance gain increases with the number of used microphones. The MAP estimators conditioned on the noisy amplitudes deliver a higher noise reduc-  tion than the multichannel MMSE estimator conditioned on the complex spectra at a lower speech quality. The gain in terms of noise reduction can be exchanged for a gain in terms of speech quality by different parameters.

Performance in realistic noise
Instead of uncorrelated white noise, we now mix the speech signal with noise recorded with a linear microphone array inside a crowded cafeteria. The coherence function of the approximately diffuse noise field is shown in Figure 3. Figure 7 plots the performance of the estimators using M = 4 microphones with an interelement spacing of d = 12 cm. Figure 8 shows the performance when using recordings with half the microphone distance, that is, d = 6 cm interelement spacing. The 4d-MAP estimator provides both higher speech quality and higher noise reduction amount than the Ephraim-Malah estimator. In both cases, the multichannel MMSE estimator delivers a much higher speech quality at an equal or lower noise reduction. According to (6), the noise correlation increases with decreasing microphone distance. Thus, the performance gain of the multichannel estimators decreases. However, Figures 7 and 8 illustrate that significant performance gains are found at reasonable microphone distances.
Clearly, if the noise is spatially coherent, no performance gain can be expected by the multichannel spectral amplitude estimators. Compared to the 1d-MMSE, the Md-MMSE and Md-MAP deliver a lower noise reduction amount at a higher speech quality when applied to speech disturbed by coherent noise.

DOA dependency
We examine the performance of the estimators when changing the DOA of the desired signal. We consider desired sources in both far and near field with respect to an array of M = 4 microphones with d = 12 cm.

Desired signal in far field
The far-field model assumes equal amplitudes and angledependent TDOAs: Figures 9 and 10 show the performance of the 4destimators with cafeteria noise when the speech arrives from θ = 0 • , 10 • , 20 • , or 60 • (see Figure 2). The performance of the MMSE estimator conditioned on the noisy spectra decreases with increasing angle of arrival. The speech quality decreases significantly, while the noise reduction amount is only slightly affected. This is because the phase assumption α i = α, i ∈ {1, . . . , M} is not fulfilled.
On the other hand, the performance of the multichannel MAP estimator conditioned on the spectral amplitudes shows almost no dependency on the DOA.

Desired signal in near field
We investigate the performance when the source of the desired signal is located in the near field with distance ρ i to     microphone i. To simulate a near-field source, we use rangedependent amplifications and time differences: where the amplitude factor for each channel decreases with the distance, a i ∼ 1/ρ i . The source is located at different distances x 0 in front of the linear microphone array (θ = 0 • ) with M = 4 and d = 12 cm such that ρ i = x 2 0 + r 2 i , where r i is defined in Figure 2. Figures 11 and 12 show the performance of the 4d-MMSE and 4d-MAP estimators, respectively, when the source is located at x 0 = 25 cm, 50 cm, or 100 cm from the microphone array. The speech quality of the multichannel MMSE estimator decreases with decreasing distance. This is because at a higher distance from the microphone array, the time difference is smaller. Again, the multichannel MAP estimator conditioned on the noisy amplitudes shows nearly no dependency on the near-field position of the desired source.

Reverberant desired signal
Finally, we examine the performance of the estimators with a reverberant desired signal. Reverberation causes the spectral phases and amplitudes to become somewhat arbitrary, reducing the correlation of the desired signal. For the generation of reverberant speech signal, we simulate the acoustic situation depicted in Figure 13. The microphone array with  microphone were simulated with the image method [19], which models the reflecting walls by several image sources. The intensity of the sound from an image source at the microphone array is determined by a frequency-independent reflection coefficient β and by the distance to the array.
In our experiment, the reverberation time was set to T = 0.2 second, which corresponds to a defection coefficient β = 0.72 according to Eyring's formula 13.82 (30) Figure 14 shows the performance of the estimators when the reverberant speech signal is mixed with cafeteria noise. As expected, the overall performance gain obtained by the multichannel estimators decreases. However, there is still a significant improvement by the multichannel MAP estimator conditioned on the spectral amplitudes left. The multichannel MMSE estimator conditioned on the complex spectra performs worse due to its sensitivity to phase errors caused by reverberation.

CONCLUSION
We have derived analytically a multichannel MMSE and a MAP estimator of the speech spectral amplitudes, which can be considered as generalizations of [9,11] to the multichannel case. Both estimators provide a significant gain compared to the well-known Ephraim-Malah estimator when the highly correlated speech components are in phase and the noise components are sufficiently uncorrelated. The MAP estimator conditioned on the noisy spectral amplitudes performs multichannel speech enhancement independent of the position of the desired source in the near or the far field and is only moderately susceptible to reverberation. The multichannel noise reduction system is well suited for real-time implementation. It outputs multiple enhanced signals which can be combined by a beamformer for additional speech enhancement.