- Research Article
- Open Access
Microphone Diversity Combining for In-Car Applications
© Jürgen Freudenberger et al. 2010
- Received: 1 August 2009
- Accepted: 17 March 2010
- Published: 27 April 2010
This paper proposes a frequency domain diversity approach for two or more microphone signals, for example, for in-car applications. The microphones should be positioned separately to insure diverse signal conditions and incoherent recording of noise. This enables a better compromise for the microphone position with respect to different speaker sizes and noise sources. This work proposes a two-stage approach. In the first stage, the microphone signals are weighted with respect to their signal-to-noise ratio and then summed similar to maximum ratio combining. The combined signal is then used as a reference for a frequency domain least-mean-squares (LMS) filter for each input signal. The output SNR is significantly improved compared to coherence-based noise reduction systems, even if one microphone is heavily corrupted by noise.
- Speech Signal
- Mean Opinion Score
- Voice Activity Detection
- Spectral Subtraction
- Noise Power Spectral Density
With in-car speech applications like hands-free car kits and speech recognition systems, speech is corrupted by engine noise and other noise sources like airflow from electric fans or car windows. For safety and comfort reasons, hands-free telephone systems should provide the same quality of speech as conventional fixed telephones. In practice however, the speech quality of a hands-free car kit heavily depends on the particular position of the microphone. Speech has to be picked up as directly as possible to reduce reverberation and to provide a sufficient signal-to-noise ratio. The important question, where to place the microphone inside the car, is, however, difficult to answer. The position is apparently a compromise for different speaker sizes, because the distance between microphone and speaker depends significantly on the position of the driver and therefore on the size of the driver. Furthermore, noise sources like airflow from electric fans or car windows have to be considered. Placing two or more microphones in different positions enables a better compromise with respect to different speaker sizes and yields more noise robustness.
Today, noise reduction in hands-free car kits and in-car speech recognition systems is usually based on single channel noise reduction or beamformer arrays [1–3]. Good noise robustness of single microphone systems requires the use of single channel noise suppression techniques, most of them derived from spectral subtraction . Such noise reduction algorithms improve the signal-to-noise ratio, but they usually introduce undesired speech distortion. Microphone arrays can improve the performance compared to single microphone systems. Nevertheless, the signal quality does still depend on the speaker position. Moreover, the microphones are located in close proximity. Therefore, microphone arrays are often vulnerable to airflow that might disturb all microphone signals.
Alternatively, multimicrophone setups have been proposed that combine the processed signals of two or more separate microphones. The microphones are positioned separately (e.g., 40 to 80 cm apart) in order to ensure incoherent recording of noise [5–11]. Similar multichannel signal processing systems have been suggested to reduce signal distortion due to reverberation [12, 13]. Basically, all these approaches exploit the fact that speech components in the microphone signals are strongly correlated while the noise components are only weakly correlated if the distance between the microphones is sufficiently large.
The question at hand with distributed arrays is how to combine these microphone signals with possibly rather different signal conditions? In this paper, we consider a diversity technique that combines the processed signals of several separate microphones. The basic idea of our approach is to apply maximum-ratio-combining (MRC) to speech signals, where we propose a frequency domain diversity approach for two or more microphone signals. MRC maximizes the signal-to-noise ratio in the combined signal.
A major issue for the application of maximum-ratio-combining for multimicrophone setups is the estimation of the acoustic transfer functions. In telecommunications, the signal attenuation as well as the phase shift for each transmission path are usually measured to apply MRC. With speech applications we have no means to directly measure the acoustic transfer functions. There exists several blind approaches to estimate the acoustic transfer functions (see e.g., [14–16]) which were successfully applied to dereverberation. However, the proposed estimation methods are computationally demanding.
In this paper, we show that maximum-ratio-combining can be achieved without explicit knowledge of the acoustic transfer functions. Proper signal weighting can be achieved based on an estimate of the input signal-to-noise ratio. We propose a two stage processing of the microphone signals. In the first stage, the microphone signals are weighted with respect to their input signal-to-noise ratio. These weights guarantee maximum-ratio-combining of the signals with respect to the signal magnitudes. To ensure cophasal addition of the weighted signals, we use the combined signal as reference signal for frequency domain LMS filters in the second stage. These filters adjust the phases of the microphone signals to guarantee coherent signal combining.
The proposed concept is similar to the single channel noise reduction system presented by Mukherjee and Gwee . This system uses spectral subtraction to obtain a crude estimate of the speech signal. This estimate is then used as the reference signal of a single LMS filter. In this paper, we generalize this concept to multimicrophone systems, where our aim is not only noise reduction, but also dereverberation of the microphone signals.
The paper is organized as follows: In Section 2, we present some measurement results obtained in a car environment. This results motivate the proposed diversity approach. In Section 3, we present a signal combiner that achieves MRC weighting based on the knowledge of the input signal-to-noise ratios. Coherence based signal combining is discussed in Section 4. In the subsequent section, we consider implementation issues. In particular, we present an estimator for the required input signal-to-noise ratios. Finally, in Section 6, we present some simulation results for different real world noise situations.
The basic idea of our spectral combining approach is to apply MRC to speech signals. To motivate this approach, we first discuss some measurement results obtained in a car environment. For these measurements, we used two cardioid microphones with positions suited for car integration. One microphone (denoted by mic. 1) was installed close to the inside mirror. The second microphone (mic. 2) was mounted at the A-pillar.
Theoretically, a MRC combining of the two input signals would result in an output SNR equal to the sum of the input SNR values. With two inputs, MRC achieves a maximum gain of 3 dB for equal input SNR values. In case of the input SNR values being rather different, the sum is dominated by the maximum value. Hence, for the curves in Figure 1 the output SNR would essentially be the envelope of the two curves.
In this section, we present the basic system concept. To simplify the discussion, we assume that all signals are stationary and that the acoustic system is linear and time-invariant. In the subsequent section we consider the modifications for nonstationary signals and time variant systems.
In the following, we assume that the speech signal and the channel coefficients are uncorrelated. We assume a complex Gaussian distribution of the noise terms . Moreover, we presume that the noise power spectral density is the same for all microphones. This assumption is reasonable for a diffuse sound field.
The optimal combining strategy that maximizes the signal-to-noise ratio in the combined signal is usually called maximal-ratio-combining (MRC) . In this section, we briefly outline the derivation of the MRC weights for completeness. Furthermore, some of the properties of maximal ratio combining are discussed.
The filter defined in (12) was previously applied to speech dereverberation by Gannot and Moonen in , because it ideally equalizes the microphone signals if a sufficiently accurate estimate of the acoustic transfer functions is available. The problem at hand with maximum-ratio-combining is that it is rather difficult and computationally complex to explicitly estimate the acoustic transfer characteristic for our microphone system.
In the next section, we show that MRC combining can be achieved without explicit knowledge of the acoustic channels. The weights for the different microphones can be calculated based on an estimate of the signal-to-noise ratio for each microphone. The proposed filter achieves a signal-to-noise ratio according to (9), but does not guarantee perfect equalization.
3.2. Diversity Combining for Speech Signals
3.3. Magnitude Combining
leads to unreliable phase values for time-frequency points without speech. In particular, if for some frequency , the estimated phase is undefined. A combining using this estimate leads to additional signal distortions. Additionally, noise correlation would distort the phase estimation. A coarse estimate of the phase difference can also be obtained from the time-shift between the speech components in the microphone signals, for example, using the generalized correlation method . The estimate is then . Note that a combiner using these phase values would in a certain manner be equivalent to a delay-and-sum beamformer. However, for distributed microphone arrays in reverberant environments this phase compensation leads to a poor estimate of the actual phase differences.
Magnitude combining does not therefore guarantee maximum-ratio-combining. Yet the signal is taken as a reference signal in the second stage where the phase compensation is done. This coherence based signal combining scheme is described in the following section.
However, as we have seen in Section 2, for practical microphone distances in the range of 0.4 to 0.8 m the noise signals are correlated for low frequencies. These correlations reduce the noise suppression capabilities of the algorithm and lead to musical noise.
We will show in this section that a combination of the spectral combining with the coherence based approach by Martin and Vary reduces this issues.
4.1. Analysis of the LMS Approach
The filter according to (28) results in fact in a minimum mean squared error (MMSE) estimate of the signal based on the signal . Hence, the weighted output is a combination of the MMSE estimates of the speech components of the two input signals. This explains the good noise reduction properties of the approach by Martin and Vary.
On the other hand, the coherence of the noise depends strongly on the distance between the microphones. For in-car applications, practical distances are in the range of 0.4 to 0.8 m. Therefore, only the noise components for frequencies above 1 kHz can be considered to be uncorrelated .
of the filter transfer function. An approach to correct the filter bias by estimating the noise cross-power density was presented in . Another issue with speech enhancement solely based on the LMS approach is that the speech signals at the microphone inputs may only be weakly correlated for some frequencies as shown in Section 2. Consequently, these frequency components will be attenuated in the output signals.
In the following, we discuss a modified LMS approach, where we first combine the microphone signals to obtain an improved reference signal for the adaptation of the LMS filters.
4.2. Combining MRC and LMS
To ensure suitable weighting and coherent signal addition we combine the diversity technique with the LMS approach to process the signals of the different microphones. It is informative to examine the combined approach under ideal conditions, that is, we assume ideal MRC weighting.
in this sum is the Wiener filter that results in a minimum mean squared error estimate of the signal based on the signal . The Wiener filter equalizes the microphone signal and minimizes the mean squared error between the filter output and the actual speech signal . Note that the phase of the term in (36) is , that is, the filter compensates the phase of the acoustic transfer function .
The other terms in the sum can be considered as filter biases where the term in (34) depends on the noise power density of the th input. The remaining terms depend on the noise cross power and vanish for uncorrelated noise signals. However, noise correlation might distort the phase estimation.
The LMS algorithm estimates implicitly the phase differences between the reference signal and the input signals . Hence, the spectra at the outputs of the filters are in phase. This enables a cophasal addition of the signals according to (20).
By estimating the noise power and noise cross-power densities we could correct the biases of the LMS filter transfer functions. Similarly, reducing the noisy signal components in (30) diminishes the filter biases. In the following, we will pursue the latter approach.
4.3. Noise Suppression
Maximum-ratio-combining provides an optimum weighting of the sensor signals. However, it does not necessarily suppress the noisy signal components. We therefore combine the spectral combining with an additional noise suppression filter. Of the numerous proposed noise reduction techniques in literature, we consider only spectral subtraction  which supplements the spectral combining quite naturally. The basic idea of spectral subtraction is to subtract an estimate of the noise floor from an estimate of the spectrum of the noisy signal.
This formula shows that noise suppression can be introduced by simply adding a constant to the numerator term in (14).
The parameter does hardly affect the gain factors for high signal-to-noise ratios retaining optimum weighting. For low signal-to-noise ratios this term leads to an additional attenuation. The over-subtraction factor is usually a function of the SNR, sometimes it is also chosen differently for different frequency bands .
Real world speech and noise signals are non-stationary processes. For an implementation of the spectral weighting, we have to consider short-time spectra of the microphone signals and estimate the short-time power spectral densities (PSD) of the speech signal and the noise components.
Therefore, the noisy signal is transformed into the frequency domain using a short-time Fourier transform of length . Each block of consecutive samples is multiplied with a Hamming window. Subsequent blocks are overlapping by samples. Let , , and denote the corresponding short-time spectra, where is the subsampled time index and is the frequency bin index.
5.1. System Structure
To perform spectral combining we have to estimate the current signal-to-noise ratio based on the noisy microphone input signals. In the next sections, we propose a simple and efficient method to estimate the noise power spectral densities of the microphone inputs.
5.2. PSD Estimation
Commonly the noise PSD is estimated in speech pauses where the pauses are detected using voice activity detection (VAD, see e.g., [24, 26]). VAD-based methods provide good estimates for stationary noise. However, they may suffer from error propagation if subsequent decisions are not independent. Other methods, like the minimum statistics approach introduced by Martin [23, 27], use a continuous estimation that does not explicitly differentiate between speech pauses and speech active segments.
Our estimation method combines the VAD approach with the minimum statistics (MS) method. Minimum statistics is a robust technique to estimate the power spectral density of non-stationary noise by tracing the minimum of the recursively smoothed power spectral density within a time window of 1 to 2 seconds. We use these MS estimates and a simple threshold test to determine voice activity for each time-frequency point.
The proposed method prevents error propagation, because the MS approach is independent of the VAD. During speech pauses the noise PSD estimation can be enhanced compared with an estimate solely based on minimum statistics. A similar time-frequency dependent VAD was presented by Cohen to enhance the noise power spectral density estimation of minimum statistics .
The MS approach provides a rough estimate of the noise power that strongly depends on the smoothing parameter and the window size of the sliding window (for details cf. ). However, this estimate can be obtained regardless of speech being present or not.
assuming that the noise and speech signals are uncorrelated.
5.3. Voice Activity Detection
Human speech contains gaps not only in time but also in frequency domain. It is therefore reasonable to estimate the voice activity in the time-frequency domain in order to obtain a more accurate VAD. The VAD function can then be calculated upon the current input noise PSD obtained by minimum statistics.
in speech pauses.
is according to the Neyman-Pearson-Lemma (see e.g., ) an optimal decision rule. That is, for a fixed probability of a false alarm it minimizes the probability of a miss and vice versa. The generalized likelihood-ratio test was previously used by Sohn and Sung to detect speech activity in subbands [29, 31].
This threshold test is equivalent to the decision rule in (61). With this threshold test, speech is detected if the current input power is greater or equal to the average noise power times the threshold . This factor depends on the input signal-to-noise ratio and the a priori probability of speech absence .
In this section, we present some simulation results for different noise conditions typical in a car. For our simulations we consider the same microphone setup as described in Section 2, that is, we use a two-channel diversity system, because this is probably the most interesting case for in-car applications.
Average input SNR values [dB] from mic. 1/mic. 2 for typical background noise conditions in a car.
6.1. Estimating the Noise PSD
between the actual noise power spectrum and the estimate . From the definition, it is obvious that the LSD can be interpreted as the mean distance between two PSDs in dB. An extended analysis of different distance measures is presented in .
6.2. Spectral Combining
We observe that the output SNR curve closely follows the ideal curve but with a loss of 1–3 dB. This loss is essentially caused by the phase differences of the input signals. With the spectral combining approach only a magnitude combining is possible. Furthermore, the power spectral densities are estimates based on the noisy microphone signals, this leads to an additional loss in the SNR.
6.3. Combining SC and FLMS
Output SNR values [dB] for different combining techniques—short/tall speaker.
SC + FLMS
Evaluation of the MOS-Test.
SC + FLMS
In this paper, we have presented a diversity technique that combines the processed signals of several separate microphones. The aim of our approach was noise robustness for in-car hands-free applications, because single channel noise suppression methods are sensitive to the microphone location and in particular to the distance between speaker and microphone.
We have shown theoretically that the proposed signal weighting is equivalent to maximum-ratio-combining. Here we have assumed that the noise power spectral densities are equal for all microphone inputs. This assumption might be unrealistic. However, the simulation results for a two-microphone system demonstrate that a performance close to that of MRC can be achieved with real world noise situations.
Moreover, diversity combining is an effective means to reduce signal distortions due to reverberation and therefore improves the speech intelligibility compared to single channel noise reduction. This improvement can be explained by the fact that spectral combining equalizes frequency dips that occur only in one microphone input (cf. Figure 3).
The spectral combining requires an SNR estimate for each input signal. We have presented a simple noise PSD estimator that reliably approximates the noise power for stationary as well as instationary noise.
Research for this paper was supported by the German Federal Ministry of Education and Research (Grant no. 17 N11 08). Last but not the least, the authors would like to thank the reviewers for their constructive comments and suggestions which greatly improve the quality of this paper.
- Hänsler E, Schmidt G: Acoustic Echo and Noise Control: A Practical Approach. John Wiley & Sons, New York, NY, USA; 2004.View ArticleGoogle Scholar
- Vary P, Martin R: Digital Speech Transmission: Enhancement, Coding and Error Concealment. John Wiley & Sons, New York, NY, USA; 2006.View ArticleGoogle Scholar
- Hänsler E, Schmidt G: Speech and Audio Processing in Adverse Environments: Signals and Communication Technologie. Springer, Berlin, Germany; 2008.View ArticleGoogle Scholar
- Boll S: Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209View ArticleGoogle Scholar
- Martin R, Vary P: A symmetric two microphone speech enhancement system theoretical limits and application in a car environment. Proceedings of the Digital Signal Processing Workshop, August 1992, Helsingoer, Denmark 451-452.Google Scholar
- Martin R, Vary P: Combined acoustic echo cancellation, dereverberation and noise reduction: a two microphone approach. Annales des Télécommunications 1994, 49(7-8):429-438.Google Scholar
- Azirani AA, Bouquin-Jeannès RL, Faucon G: Enhancement of speech degraded by coherent and incoherent noise using a cross-spectral estimator. IEEE Transactions on Speech and Audio Processing 1997, 5(5):484-487. 10.1109/89.622576View ArticleGoogle Scholar
- Guérin A, Bouquin-Jeannès RL, Faucon G: A two-sensor noise reduction system: applications for hands-free car kit. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1125-1134. 10.1155/S1110865703305098View ArticleMATHGoogle Scholar
- Freudenberger J, Linhard K: A two-microphone diversity system and its application for hands-free car kits. Proceedings of European Conference on Speech Communication and Technology (INTERSPEECH '05), September 2005, Lisbon, Portugal 2329-2332.Google Scholar
- Gerkmann T, Martin R: Soft decision combining for dual channel noise reduction. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH—ICSLP '06), September 2006, Pittsburgh, Pa, USA 5: 2134-2137.Google Scholar
- Freudenberger J, Stenzel S, Venditti B: Spectral combining for microphone diversity systems. Proceedings of European Signal Processing Conference (EUSIPCO '09), July 2009, Glasgow, UK 854-858.Google Scholar
- Flanagan JL, Lummis RC: Signal processing to reduce multipath distortion in small rooms. Journal of the Acoustical Society of America 1970, 47(6):1475-1481. 10.1121/1.1912067View ArticleGoogle Scholar
- Allen JB, Berkley DA, Blauert J: Multimicrophone signal-processing technique to remove room reverberation from speech signals. Journal of the Acoustical Society of America 1977, 62(4):912-915. 10.1121/1.381621View ArticleGoogle Scholar
- Gannot S, Moonen M: Subspace methods for multimicrophone speech dereverberation. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1074-1090. 10.1155/S1110865703305049View ArticleMATHGoogle Scholar
- Delcroix M, Hikichi T, Miyoshi M: Dereverberation and denoising using multichannel linear prediction. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(6):1791-1801.View ArticleMATHGoogle Scholar
- Ram I, Habets E, Avargel Y, Cohen I: Multi-microphone speech dereverberation using LIME and least squares filtering. Proceedings of European Signal Processing Conference (EUSIPCO '08), August 2008, Lausanne, SwitzerlandGoogle Scholar
- Mukherjee K, Gwee B-H: A 32-point FFT based noise reduction algorithm for single channel speech signals. Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS '07), May 2007, New Orleans, La, USA 3928-3931.Google Scholar
- Armbrüster W, Czarnach R, Vary P: Adaptive noise cancellation with reference input. In Signal Processing III. Elsevier; 1986:391-394.Google Scholar
- Sklar B: Digital Communications: Fundamentals and Applications. Prentice Hall, Upper Saddle River, NJ, USA; 2001.MATHGoogle Scholar
- Knapp C, Carter G: The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics Speech and Signal Processing 1976, 24(4):320-327. 10.1109/TASSP.1976.1162830View ArticleGoogle Scholar
- Freudenberger J, Stenzel S, Venditti B: An FLMS based two-microphone speech enhancement system for in-car applications. Proceedings of the 15th IEEE Workshop on Statistical Signal Processing (SSP '09), 2009 705-708.Google Scholar
- Berouti M, Schwartz R, Makhoul J: Enhancement of speech corrupted by acoustic noise. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '79), April 1979, Washington, DC, USA 208-211.View ArticleGoogle Scholar
- Martin R: Spectral subtraction based on minimum statistics. Proceedings of the European Signal Processing Conference (EUSIPCO '94), April 1994, Edinburgh, UK 1182-1185.Google Scholar
- Puder H: Single channel noise reduction using time-frequency dependent voice activity detection. Proceedings of International Workshop on Acoustic Echo and Noise Control (IWAENC '99), September 1999, Pocono Manor, Pa, USA 68-71.Google Scholar
- Juneja A, Deshmukh O, Espy-Wilson C: A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 4: 4160-4164.Google Scholar
- Ramírez J, Segura JC, Benítez C, de La Torre A, Rubio A: A new voice activity detector using subband order-statistics filters for robust speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '04), 2004 1: I849-I852.Google Scholar
- Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on Speech and Audio Processing 2001, 9(5):504-512. 10.1109/89.928915View ArticleGoogle Scholar
- Cohen I: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Transactions on Speech and Audio Processing 2003, 11(5):466-475. 10.1109/TSA.2003.811544View ArticleGoogle Scholar
- Sohn J, Sung W: A voice activity detector employing soft decision based noise spectrum adaptation. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '98), 1998 1: 365-368.Google Scholar
- Forney GD Jr.: Exponential error bounds for erasure, list, and decision feedback schemes. IEEE Transactions on Information Theory 1968, 14(2):206-220. 10.1109/TIT.1968.1054129MathSciNetView ArticleMATHGoogle Scholar
- Sohn J, Kim NS, Sung W: A statistical model-based voice activity detection. IEEE Signal Processing Letters 1999, 6(1):1-3. 10.1109/97.736233View ArticleGoogle Scholar
- ITU-T : Test signals for use in telephonometry, Recommendation ITU-T P.501. International Telecommunication Union, Geneva, Switzerland; 2007.Google Scholar
- Gray AH Jr., Markel JD: Distance measures for speech processing. IEEE Transactions on Acoustics, Speech and Signal Processing 1976, 24(5):380-391. 10.1109/TASSP.1976.1162849MathSciNetView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.