Skip to content

Advertisement

  • Research Article
  • Open Access

Microphone Diversity Combining for In-Car Applications

  • 1Email author,
  • 1 and
  • 1
EURASIP Journal on Advances in Signal Processing20102010:509541

https://doi.org/10.1155/2010/509541

  • Received: 1 August 2009
  • Accepted: 17 March 2010
  • Published:

Abstract

This paper proposes a frequency domain diversity approach for two or more microphone signals, for example, for in-car applications. The microphones should be positioned separately to insure diverse signal conditions and incoherent recording of noise. This enables a better compromise for the microphone position with respect to different speaker sizes and noise sources. This work proposes a two-stage approach. In the first stage, the microphone signals are weighted with respect to their signal-to-noise ratio and then summed similar to maximum ratio combining. The combined signal is then used as a reference for a frequency domain least-mean-squares (LMS) filter for each input signal. The output SNR is significantly improved compared to coherence-based noise reduction systems, even if one microphone is heavily corrupted by noise.

Keywords

  • Speech Signal
  • Mean Opinion Score
  • Voice Activity Detection
  • Spectral Subtraction
  • Noise Power Spectral Density

1. Introduction

With in-car speech applications like hands-free car kits and speech recognition systems, speech is corrupted by engine noise and other noise sources like airflow from electric fans or car windows. For safety and comfort reasons, hands-free telephone systems should provide the same quality of speech as conventional fixed telephones. In practice however, the speech quality of a hands-free car kit heavily depends on the particular position of the microphone. Speech has to be picked up as directly as possible to reduce reverberation and to provide a sufficient signal-to-noise ratio. The important question, where to place the microphone inside the car, is, however, difficult to answer. The position is apparently a compromise for different speaker sizes, because the distance between microphone and speaker depends significantly on the position of the driver and therefore on the size of the driver. Furthermore, noise sources like airflow from electric fans or car windows have to be considered. Placing two or more microphones in different positions enables a better compromise with respect to different speaker sizes and yields more noise robustness.

Today, noise reduction in hands-free car kits and in-car speech recognition systems is usually based on single channel noise reduction or beamformer arrays [13]. Good noise robustness of single microphone systems requires the use of single channel noise suppression techniques, most of them derived from spectral subtraction [4]. Such noise reduction algorithms improve the signal-to-noise ratio, but they usually introduce undesired speech distortion. Microphone arrays can improve the performance compared to single microphone systems. Nevertheless, the signal quality does still depend on the speaker position. Moreover, the microphones are located in close proximity. Therefore, microphone arrays are often vulnerable to airflow that might disturb all microphone signals.

Alternatively, multimicrophone setups have been proposed that combine the processed signals of two or more separate microphones. The microphones are positioned separately (e.g., 40 to 80 cm apart) in order to ensure incoherent recording of noise [511]. Similar multichannel signal processing systems have been suggested to reduce signal distortion due to reverberation [12, 13]. Basically, all these approaches exploit the fact that speech components in the microphone signals are strongly correlated while the noise components are only weakly correlated if the distance between the microphones is sufficiently large.

The question at hand with distributed arrays is how to combine these microphone signals with possibly rather different signal conditions? In this paper, we consider a diversity technique that combines the processed signals of several separate microphones. The basic idea of our approach is to apply maximum-ratio-combining (MRC) to speech signals, where we propose a frequency domain diversity approach for two or more microphone signals. MRC maximizes the signal-to-noise ratio in the combined signal.

A major issue for the application of maximum-ratio-combining for multimicrophone setups is the estimation of the acoustic transfer functions. In telecommunications, the signal attenuation as well as the phase shift for each transmission path are usually measured to apply MRC. With speech applications we have no means to directly measure the acoustic transfer functions. There exists several blind approaches to estimate the acoustic transfer functions (see e.g., [1416]) which were successfully applied to dereverberation. However, the proposed estimation methods are computationally demanding.

In this paper, we show that maximum-ratio-combining can be achieved without explicit knowledge of the acoustic transfer functions. Proper signal weighting can be achieved based on an estimate of the input signal-to-noise ratio. We propose a two stage processing of the microphone signals. In the first stage, the microphone signals are weighted with respect to their input signal-to-noise ratio. These weights guarantee maximum-ratio-combining of the signals with respect to the signal magnitudes. To ensure cophasal addition of the weighted signals, we use the combined signal as reference signal for frequency domain LMS filters in the second stage. These filters adjust the phases of the microphone signals to guarantee coherent signal combining.

The proposed concept is similar to the single channel noise reduction system presented by Mukherjee and Gwee [17]. This system uses spectral subtraction to obtain a crude estimate of the speech signal. This estimate is then used as the reference signal of a single LMS filter. In this paper, we generalize this concept to multimicrophone systems, where our aim is not only noise reduction, but also dereverberation of the microphone signals.

The paper is organized as follows: In Section 2, we present some measurement results obtained in a car environment. This results motivate the proposed diversity approach. In Section 3, we present a signal combiner that achieves MRC weighting based on the knowledge of the input signal-to-noise ratios. Coherence based signal combining is discussed in Section 4. In the subsequent section, we consider implementation issues. In particular, we present an estimator for the required input signal-to-noise ratios. Finally, in Section 6, we present some simulation results for different real world noise situations.

2. Measurement Results

The basic idea of our spectral combining approach is to apply MRC to speech signals. To motivate this approach, we first discuss some measurement results obtained in a car environment. For these measurements, we used two cardioid microphones with positions suited for car integration. One microphone (denoted by mic. 1) was installed close to the inside mirror. The second microphone (mic. 2) was mounted at the A-pillar.

Figure 1 depicts the SNR versus frequency for a driving situation at a car speed of 100 km/h. From this figure, we observe that the SNR values are quite distinct for these two microphone positions with differences of up to 10 dB depending on the particular frequency. We also note that the better microphone position is not obvious in this case, because the SNR curves cross several times.
Figure 1
Figure 1

Input SNR values for a driving situation at a car speed of 100 km/h.

Theoretically, a MRC combining of the two input signals would result in an output SNR equal to the sum of the input SNR values. With two inputs, MRC achieves a maximum gain of 3 dB for equal input SNR values. In case of the input SNR values being rather different, the sum is dominated by the maximum value. Hence, for the curves in Figure 1 the output SNR would essentially be the envelope of the two curves.

Next we consider the coherence for the noise and speech signals. The corresponding results are depicted in Figure 2. The figure presents measurements for two microphones installed close to the inside mirror in an end-fire beamformer constellation with a microphone distance of 7 cm. The lower figure contains the results for the microphone positions mic. 1 and mic. 2 (distance of 65 cm). From these results, we observe that the noise coherence closely follows the theoretical coherence function (dotted line in Figure 2) in an ideal diffuse sound field [18]. Separating the microphones significantly reduces the noise coherence for low frequencies. On the other hand, both microphone constellations have similar speech coherence. We note that the speech coherence is not ideal, as it has steep dips. The corresponding frequencies will probably be attenuated by a signal combiner that is solely based on coherence.
Figure 2
Figure 2

Coherence for noise and speech signals for tow different microphone positions.

3. Spectral Combining

In this section, we present the basic system concept. To simplify the discussion, we assume that all signals are stationary and that the acoustic system is linear and time-invariant. In the subsequent section we consider the modifications for nonstationary signals and time variant systems.

We consider a scenario with microphones. The microphone signals can be modeled by the convolution of the speech signal with the impulse response of the acoustic system plus additive noise . Hence the microphone signals can be expressed as
(1)

where denotes the convolution.

To apply the diversity technique, it is convenient to consider the signals in the frequency domain. Let be the spectrum of the speech signal and be the spectrum of the th microphone signal . The speech signal is linearly distorted by the acoustic transfer function and corrupted by the noise term . Hence, the signal observed at the th microphone has the spectrum
(2)

In the following, we assume that the speech signal and the channel coefficients are uncorrelated. We assume a complex Gaussian distribution of the noise terms . Moreover, we presume that the noise power spectral density is the same for all microphones. This assumption is reasonable for a diffuse sound field.

Our aim is to linearly combine the microphone signals so that the signal-to-noise ratio in the combined signal is maximized. In the frequency domain, the signal combining can be expressed as
(3)
where is the weight of the th microphone signal. With (2) we have
(4)
where the first sum represents the speech component and the second sum represents the noise component of the combined signal. Hence, the overall signal-to-noise ratio of the combined signal is
(5)

3.1. Maximum-Ratio-Combining

The optimal combining strategy that maximizes the signal-to-noise ratio in the combined signal is usually called maximal-ratio-combining (MRC) [19]. In this section, we briefly outline the derivation of the MRC weights for completeness. Furthermore, some of the properties of maximal ratio combining are discussed.

Let be the speech power spectral density. Assuming that the noise power is the same for all microphones and that the noise at the different microphones is uncorrelated, we have
(6)
We consider now the term in the denominator of (6). Using the Cauchy-Schwarz inequality we have
(7)
with equality if , where is the complex conjugate of the channel coefficient . Here is a real-valued constant common to all weights . Thus, for the signal-to-noise ratio we obtain
(8)
With the weights , we obtain the maximum signal-to-noise ratio of the combined signal as the sum of the signal-to-noise ratios of the received signals
(9)
where
(10)
is the input signal-to-noise ratio of the th microphone. It is appropriate to chose as
(11)
This leads to the MRC weights
(12)
and the estimated (equalized) speech spectrum
(13)

where we have omitted the dependency on . The estimated speech spectrum is therefore equal to the actual speech spectrum plus some weighted noise term.

The filter defined in (12) was previously applied to speech dereverberation by Gannot and Moonen in [14], because it ideally equalizes the microphone signals if a sufficiently accurate estimate of the acoustic transfer functions is available. The problem at hand with maximum-ratio-combining is that it is rather difficult and computationally complex to explicitly estimate the acoustic transfer characteristic for our microphone system.

In the next section, we show that MRC combining can be achieved without explicit knowledge of the acoustic channels. The weights for the different microphones can be calculated based on an estimate of the signal-to-noise ratio for each microphone. The proposed filter achieves a signal-to-noise ratio according to (9), but does not guarantee perfect equalization.

3.2. Diversity Combining for Speech Signals

We consider the weights
(14)
Assuming the noise power is the same for all microphones and substituting by (10) leads to
(15)
Hence, we have
(16)
with
(17)
We observe that the weight is proportional to the magnitude of the MRC weights , because the factor is the same for all microphone signals. Consequently, coherent addition of the sensor signals weighted with the gain factors still leads to a combining, where the signal-to-noise ratio at the combiner output is the sum of the input SNR values. However, coherent addition requires an additional phase estimate. Let denote the phase of at frequency . Assuming cophasal addition the estimated speech spectrum is
(18)
Hence, in the case of stationary signals the term
(19)
can be interpreted as the resulting transfer characteristic of the system. An example is depicted in Figure 3. The upper figure presents the measured transfer characteristics for two microphones in a car environment. Note that the microphones have a high-pass characteristic and attenuate signal components for frequencies below 1 kHz. The lower figure is the curve . The spectral combiner equalizes most of the deep dips in the transfer functions from the mouth of the speaker to the microphones while the envelope of the transfer functions is not equalized.
Figure 3
Figure 3

Transfer characteristics to the microphones and of the combined signal.

3.3. Magnitude Combining

One challenge in multimicrophone systems with spatially separated microphones is a reliable phase estimation of the different input signals. For a coherent combining of the speech signals, we have to compensate the phase difference between the speech signals at each microphone. Therefore, it is sufficient to estimate the phase differences to a reference microphone, for example, to the first microphone for all . Cophasal addition is then achieved by
(20)
But a reliable estimation of the phase differences is only possible in speech active periods and furthermore only for that frequencies where speech is present. Estimating the phase differences
(21)

leads to unreliable phase values for time-frequency points without speech. In particular, if for some frequency , the estimated phase is undefined. A combining using this estimate leads to additional signal distortions. Additionally, noise correlation would distort the phase estimation. A coarse estimate of the phase difference can also be obtained from the time-shift between the speech components in the microphone signals, for example, using the generalized correlation method [20]. The estimate is then . Note that a combiner using these phase values would in a certain manner be equivalent to a delay-and-sum beamformer. However, for distributed microphone arrays in reverberant environments this phase compensation leads to a poor estimate of the actual phase differences.

Because of the drawbacks, which come along with the phase estimation methods described above, we propose another scheme. Therefore, we use a two stage combining approach. In the first stage, we use the spectral combining approach as described in Section 3.2 with a simple magnitude combining of the microphone signals. For the magnitude combining the noisy phase of the first microphone signal is adopted to the other microphone signals. This is also obvious in Figure 5, where the phase of the noisy spectrum is taken for the spectrum at the output of the filter , before the signals were combined. This leads to the following incoherent combining of the input signals
(22)
The estimated speech spectrum is equal to
(23)
plus some weighted noise terms. It follows from the triangle inequality that
(24)

Magnitude combining does not therefore guarantee maximum-ratio-combining. Yet the signal is taken as a reference signal in the second stage where the phase compensation is done. This coherence based signal combining scheme is described in the following section.

4. Coherence-Based Combining

As an example of a coherence based diversity system we first consider the two microphone approach by Martin and Vary [5, 6] as depicted in Figure 4. Martin and Vary applied the dereverberation principle of Allen et al. [13] to noise reduction. In particular, they proposed an LMS-based time domain algorithm to combine the different microphone signals. This approach provides effective noise suppression for frequencies where the noise components of the microphone signals are uncorrelated.
Figure 4
Figure 4

Basic system structure of the LMS approach.

Figure 5
Figure 5

Basic system structure of the diversity system with two inputs.

However, as we have seen in Section 2, for practical microphone distances in the range of 0.4 to 0.8 m the noise signals are correlated for low frequencies. These correlations reduce the noise suppression capabilities of the algorithm and lead to musical noise.

We will show in this section that a combination of the spectral combining with the coherence based approach by Martin and Vary reduces this issues.

4.1. Analysis of the LMS Approach

We present now an analysis of the scheme by Martin and Vary as depicted in Figure 4. The filter is adapted using the LMS algorithm. For stationary signals , , and , the adaptation converts to filter coefficients and a corresponding filter transfer function
(25)
that minimizes the expected value
(26)

where is the cross-power spectrum of the two microphone signals and is the power spectrum of the th microphone signal.

Assuming that the speech signal and the noise signals are uncorrelated, (25) can be written as
(27)
For frequencies where the noise components are uncorrelated, that is, , this formula is reduced to
(28)

The filter according to (28) results in fact in a minimum mean squared error (MMSE) estimate of the signal based on the signal . Hence, the weighted output is a combination of the MMSE estimates of the speech components of the two input signals. This explains the good noise reduction properties of the approach by Martin and Vary.

On the other hand, the coherence of the noise depends strongly on the distance between the microphones. For in-car applications, practical distances are in the range of 0.4 to 0.8 m. Therefore, only the noise components for frequencies above 1 kHz can be considered to be uncorrelated [6].

According to formula (27), the noise correlation leads to a bias
(29)

of the filter transfer function. An approach to correct the filter bias by estimating the noise cross-power density was presented in [21]. Another issue with speech enhancement solely based on the LMS approach is that the speech signals at the microphone inputs may only be weakly correlated for some frequencies as shown in Section 2. Consequently, these frequency components will be attenuated in the output signals.

In the following, we discuss a modified LMS approach, where we first combine the microphone signals to obtain an improved reference signal for the adaptation of the LMS filters.

4.2. Combining MRC and LMS

To ensure suitable weighting and coherent signal addition we combine the diversity technique with the LMS approach to process the signals of the different microphones. It is informative to examine the combined approach under ideal conditions, that is, we assume ideal MRC weighting.

Analog to (13), weighting with the MRC gains factors according to (12) results in the estimate
(30)
We now use the estimate as the reference signal for the LMS algorithm. That is, we adapted a filter for each input signal such that the expected value
(31)
is minimized. The adaptation results in the filter transfer functions
(32)
Assuming that the speech signal and the noise signals are uncorrelated and substituting according to (30) leads to
(33)
(34)
(35)
The first term
(36)

in this sum is the Wiener filter that results in a minimum mean squared error estimate of the signal based on the signal . The Wiener filter equalizes the microphone signal and minimizes the mean squared error between the filter output and the actual speech signal . Note that the phase of the term in (36) is , that is, the filter compensates the phase of the acoustic transfer function .

The other terms in the sum can be considered as filter biases where the term in (34) depends on the noise power density of the th input. The remaining terms depend on the noise cross power and vanish for uncorrelated noise signals. However, noise correlation might distort the phase estimation.

Similarly, when we consider the actual reference signal according to (22), the filter equation for contains the term
(37)
with the sought phase . If the correlation of the noise terms is sufficiently small we obtain the estimated phase
(38)

The LMS algorithm estimates implicitly the phase differences between the reference signal and the input signals . Hence, the spectra at the outputs of the filters are in phase. This enables a cophasal addition of the signals according to (20).

By estimating the noise power and noise cross-power densities we could correct the biases of the LMS filter transfer functions. Similarly, reducing the noisy signal components in (30) diminishes the filter biases. In the following, we will pursue the latter approach.

4.3. Noise Suppression

Maximum-ratio-combining provides an optimum weighting of the sensor signals. However, it does not necessarily suppress the noisy signal components. We therefore combine the spectral combining with an additional noise suppression filter. Of the numerous proposed noise reduction techniques in literature, we consider only spectral subtraction [4] which supplements the spectral combining quite naturally. The basic idea of spectral subtraction is to subtract an estimate of the noise floor from an estimate of the spectrum of the noisy signal.

Estimating the overall SNR according to (9) the spectral subtraction filter (see i.e., [1, page 239]) for the combined signal can be written as
(39)
Multiplying this filter transfer function with (14) leads to the term
(40)

This formula shows that noise suppression can be introduced by simply adding a constant to the numerator term in (14).

Most, if not all, implementations of spectral subtraction are based on an over-subtraction approach, where an overestimate of the noise power is subtracted from the power spectrum of the input signal (see e.g., [2225]). Over-subtraction can be included in (40) by using a constant larger than one. This leads to the final gain factor
(41)

The parameter does hardly affect the gain factors for high signal-to-noise ratios retaining optimum weighting. For low signal-to-noise ratios this term leads to an additional attenuation. The over-subtraction factor is usually a function of the SNR, sometimes it is also chosen differently for different frequency bands [25].

5. Implementation Issues

Real world speech and noise signals are non-stationary processes. For an implementation of the spectral weighting, we have to consider short-time spectra of the microphone signals and estimate the short-time power spectral densities (PSD) of the speech signal and the noise components.

Therefore, the noisy signal is transformed into the frequency domain using a short-time Fourier transform of length . Each block of consecutive samples is multiplied with a Hamming window. Subsequent blocks are overlapping by samples. Let , , and denote the corresponding short-time spectra, where is the subsampled time index and is the frequency bin index.

5.1. System Structure

The processing system for two inputs is depicted in Figure 5. The spectrum results from incoherent magnitude combining of the input signals
(42)
where
(43)
The power spectral density of speech signals is relatively fast time varying. Therefore, the FLMS algorithm requires a quick update, that is, a large step size. If the step size is sufficiently large the magnitudes of the FLMS filters follow the filters . Because the spectra at the outputs of the filters are in phase, we obtain the estimated speech spectrum as
(44)

To perform spectral combining we have to estimate the current signal-to-noise ratio based on the noisy microphone input signals. In the next sections, we propose a simple and efficient method to estimate the noise power spectral densities of the microphone inputs.

5.2. PSD Estimation

Commonly the noise PSD is estimated in speech pauses where the pauses are detected using voice activity detection (VAD, see e.g., [24, 26]). VAD-based methods provide good estimates for stationary noise. However, they may suffer from error propagation if subsequent decisions are not independent. Other methods, like the minimum statistics approach introduced by Martin [23, 27], use a continuous estimation that does not explicitly differentiate between speech pauses and speech active segments.

Our estimation method combines the VAD approach with the minimum statistics (MS) method. Minimum statistics is a robust technique to estimate the power spectral density of non-stationary noise by tracing the minimum of the recursively smoothed power spectral density within a time window of 1 to 2 seconds. We use these MS estimates and a simple threshold test to determine voice activity for each time-frequency point.

The proposed method prevents error propagation, because the MS approach is independent of the VAD. During speech pauses the noise PSD estimation can be enhanced compared with an estimate solely based on minimum statistics. A similar time-frequency dependent VAD was presented by Cohen to enhance the noise power spectral density estimation of minimum statistics [28].

For time-frequency points where the speech signal is inactive, the noise PSD can be approximated by recursive smoothing
(45)
with
(46)

where is the smoothing parameter.

During speech active periods the PSD can be estimated using the minimum statistics method introduced by Martin [23, 27]. With this approach, the noise PSD estimate is determined by the minimum value
(47)
within a sliding window of consecutive values of . The noise PSD is then estimated by
(48)
where is a parameter of the algorithm and should be approximated as
(49)

The MS approach provides a rough estimate of the noise power that strongly depends on the smoothing parameter and the window size of the sliding window (for details cf. [27]). However, this estimate can be obtained regardless of speech being present or not.

The idea of our approach is to approximate the PSD by the MS estimate during speech active periods while the smoothed input power is used for time-frequency points where speech is absent.
(50)

where is an indicator function for speech activity which will be discussed in more detail in the next section.

The current signal-to-noise ratio is then obtained by
(51)

assuming that the noise and speech signals are uncorrelated.

5.3. Voice Activity Detection

Human speech contains gaps not only in time but also in frequency domain. It is therefore reasonable to estimate the voice activity in the time-frequency domain in order to obtain a more accurate VAD. The VAD function can then be calculated upon the current input noise PSD obtained by minimum statistics.

Our aim is to determine for each time-frequency point whether the speech signal is active or inactive. We therefore consider the two hypotheses and which indicate speech presence or absence at the time-frequency point , respectively. We assume that the coefficients and of the short-time spectra of both the speech and the noise signal are complex Gaussian random variables. In this case, the current input power, that is, squared magnitude , is exponentially distributed with mean (power spectral density)
(52)
Similarly we define
(53)
We assume that speech and noise are uncorrelated. Hence, we have
(54)
during speech active periods and
(55)

in speech pauses.

In the following, we occasionally omit the dependency on and in order to keep the notation lucid. The conditional probability density functions of the random variable are [29]
(56)
(57)
Applying Bayes rule for the conditional speech presence probability
(58)
we have [29]
(59)
where is the a priori probability of speech absence and
(60)
The decision rule for the th channel is based on the conditional speech presence probability
(61)
The parameter enables a tradeoff between the two possible error probabilities of voice activity detection. A value decreases the probability of a false alarm, that is, when speech is absent. reduces the probability of a miss, that is, in the presence of speech. Note that the generalized likelihood-ratio test
(62)

is according to the Neyman-Pearson-Lemma (see e.g., [30]) an optimal decision rule. That is, for a fixed probability of a false alarm it minimizes the probability of a miss and vice versa. The generalized likelihood-ratio test was previously used by Sohn and Sung to detect speech activity in subbands [29, 31].

The test in inequality (62) is equivalent to
(63)
where we have used (59). Solving for using (60), we obtain a simple threshold test for the th microphone
(64)
with the threshold
(65)

This threshold test is equivalent to the decision rule in (61). With this threshold test, speech is detected if the current input power is greater or equal to the average noise power times the threshold . This factor depends on the input signal-to-noise ratio and the a priori probability of speech absence .

In order to combine the activity estimates for the different input signals, we use the following rule
(66)

6. Simulation Results

In this section, we present some simulation results for different noise conditions typical in a car. For our simulations we consider the same microphone setup as described in Section 2, that is, we use a two-channel diversity system, because this is probably the most interesting case for in-car applications.

With respect to three different background noise situations, we recorded driving noise at 100 km/h and 140 km/h. As third noise situation, we considered the noise which arises from an electric fan (defroster). With an artificial head we recorded speech samples for two different seat positions. From both positions, we recorded two male and two female speech samples, each of a length of 8 seconds. Therefore, we took the German-speaking speech samples from the recommendation P.501 of the International Telecommunication Union (ITU) [32]. Hence the evaluation was done using four different voices with two different speaker sizes, which leads to 8 different speaker configurations. For all recordings, we used a sampling rate of 11025 Hz. Table 1 contains the average SNR values for the considered noise conditions. The first values in each field are with respect to a short speaker while the second ones are according to a tall person. For all algorithms, we used an FFT length of and an overlap of 256 samples. For time windowing we apply a Hamming window.
Table 1

Average input SNR values [dB] from mic. 1/mic. 2 for typical background noise conditions in a car.

SNR IN

100 km/h

140 km/h

defrost

short speaker

1.2/3.1

−0.7/−0.5

1.7/1.3

tall speaker

1.9/10.8

−0.1/7.2

2.4/9.0

6.1. Estimating the Noise PSD

The spectrogram of one input signal and the result of the voice activity detection are shown in Figure 6 for the worst case scenario (short speaker at car speed of 140 km/h). It can be observed that time-frequency points with speech activity are reliably detected. Because the noise PSD is estimated with minimum statistics also during speech activity, the false alarms in speech pauses do hardly affect the noise PSD estimation.
Figure 6
Figure 6

Spectrogram of the microphone input (mic. 1 at car speed of 140 km/h, short speaker). The lower figure depicts the results of the voice activity detection (black representing estimated speech activity) with T = 1.2 and q = 0.5.

In Figure 7, we compare the estimated noise PSD with actual PSD for the same scenario. The PSD is well approximated with only minor deviations for high frequencies. To evaluate the noise PSD estimation for several driving situations we calculated as an objective performance measure the log spectral distance (LSD)
(67)
Figure 7
Figure 7

Estimated and actual noise PSD for mic. 2 at car speed of 140 km/h.

between the actual noise power spectrum and the estimate . From the definition, it is obvious that the LSD can be interpreted as the mean distance between two PSDs in dB. An extended analysis of different distance measures is presented in [33].

The log spectral distances of the proposed noise PSD estimator are shown in Table 2. The first number in each field is the LSD achieved with the minimum statistics approach while the second number is the value for the proposed scheme. Note that every noise situation was evaluated with four different voices (two male and two female). From these results, we observe that the voice activity detection improves the PSD estimation for all considered driving situations.
Table 2

Log spectral distances with minimum statistics noise PSD estimation and with the proposed noise PSD estimator.

[dB]

100 km/h

140 km/h

defrost

mic. 1

3.93/3.33

2.47/2.07

3.07/1.27

mic. 2

4.6/4.5

3.03/2.33

3.4/1.5

6.2. Spectral Combining

Next we consider the spectral combining as discussed in Section 3. Figure 8 presents the output SNR values for a driving situation with a car speed of 100 km/h. For this simulation we used , that is, spectral combining without noise suppression. In addition to the output SNR, the curve for ideal maximum-ratio-combining is depicted. This curve is simply the sum of the input SNR values for the two microphones which we calculated based on the actual noise and speech signals (cf. Figure 1).
Figure 8
Figure 8

Output SNR values for spectral combining without additional noise suppression (car speed of 100 km/h, ρ = 0).

We observe that the output SNR curve closely follows the ideal curve but with a loss of 1–3 dB. This loss is essentially caused by the phase differences of the input signals. With the spectral combining approach only a magnitude combining is possible. Furthermore, the power spectral densities are estimates based on the noisy microphone signals, this leads to an additional loss in the SNR.

6.3. Combining SC and FLMS

The output SNR of the combined approach without additional noise suppression is depicted in Figure 9. It is obvious that the theoretical SNR curve for ideal MRC is closely approximated by the output SNR of the combined system. This is the result of the implicit phase estimation of the FLMS approach which leads to a coherent combining of the speech signals.
Figure 9
Figure 9

Output SNR values for the combined approach without additional noise suppression (car speed of 100 km/h, ρ = 0).

Now we consider the combined approach with additional noise suppression ( = 10). Figure 10 presents the corresponding results for a driving situation with a car speed of 100 km/h. The output SNR curve still follows the ideal MRC curve but now with a gain of up to 5 dB.
Figure 10
Figure 10

Output SNR values for the combined approach with additional noise suppression (car speed of 100 km/h, ρ = 0).

In Table 3, we compare the output SNR values of the three considered noise conditions for different combining techniques. The first value is the output SNR for a short speaker while the second number represents the result for the tall speaker. The values marked with FLMS correspond to the coherence based FLMS approach with bias compensation as presented in [21] (see also Section 4.1). The label SC marks results solely based on spectral combining with additional noise suppression as discussed in Sections 3 and 4.3. The results with the combined approach are labeled by SC + FLMS. Finally, the values marked with the label ideal FLMS are a benchmark obtained by using the clean and unreverberant speech signal as a reference for the FLMS algorithm.
Table 3

Output SNR values [dB] for different combining techniques—short/tall speaker.

SNR OUT

100 km/h

140 km/h

defrost

FLMS

8.8/13.3

4.4/9.0

7.8/12.3

SC

16.3/20.9

13.3/18.0

14.9/19.9

SC + FLMS

13.5/17.8

10.5/15.0

12.5/16.9

ideal FLMS

12.6/15.2

10.5/13.3

14.5/17.3

From the results in Table 3, we observe that the spectral combining leads to a significant improvement of the output SNR compared to the coherence based noise reduction. It even outperforms the "ideal" FLMS scheme. However, the spectral combining introduces undesired speech distortions similar to single channel noise reduction. This is also indicated by the results in Table 4. This table presents distance values for the different combining systems. As an objective measure of speech distortion, we calculated the cosh spectral distance (a symmetrical version of the Itakura-Saito distance) between the power spectra of the clean input signal (without reverberation and noise) and the output speech signal (filter coefficients were obtained from noisy data).
Table 4

Cosh spectral distances for different combining techniques—short/tall speaker.

100 km/h

140 km/h

defrost

FLMS

0.9/0.9

0.9/1.0

1.2/1.2

SC

1.3/1.4

1.4/1.5

1.5/1.7

SC + FLMS

1.2/1.1

1.2/1.2

1.4/1.5

ideal FLMS

0.9/0.8

1.1/1.0

1.5/1.4

The benefit of the combined system is also indicated by the results in Table 5 which presents Mean Opinion Score (MOS) values for the different algorithms. The MOS test was performed by 24 persons. The test set was taken in a randomized order to avoid statistical dependences on the test order. Obviously, the FLMS approach using spectral combining as reference signal and the "ideal" FLMS filter reference approach are rated as the best noise reduction algorithm, where the values of the combined approach are similar to the results with the reference implementation of the "ideal" FLMS filter solution. From this evaluation, it can also be seen that the FLMS approach with spectral combining outperforms the pure FLMS and the pure spectral combining algorithms in all tested acoustic situations.
Table 5

Evaluation of the MOS-Test.

MOS

100 km/h

140 km/h

defrost

average

FLMS

2.58

2.77

2.10

2.49

SC

3.19

3.15

2.96

3.10

SC + FLMS

3.75

3.73

3.88

3.78

ideal FLMS

3.81

3.67

3.94

3.81

The combined approach sounds more natural compared to the pure spectral combining. The SNR and distance values are close to the "ideal" FLMS scheme. The speech is free of musical tones. The lack of musical noise can also be seen in Figure 11, which shows the spectrograms of the enhanced speech and the input signals.
Figure 11
Figure 11

Spectrograms of the input and output signals with the SC + FLMS approach (car speed of 100 km/h, ρ = 0).

7. Conclusions

In this paper, we have presented a diversity technique that combines the processed signals of several separate microphones. The aim of our approach was noise robustness for in-car hands-free applications, because single channel noise suppression methods are sensitive to the microphone location and in particular to the distance between speaker and microphone.

We have shown theoretically that the proposed signal weighting is equivalent to maximum-ratio-combining. Here we have assumed that the noise power spectral densities are equal for all microphone inputs. This assumption might be unrealistic. However, the simulation results for a two-microphone system demonstrate that a performance close to that of MRC can be achieved with real world noise situations.

Moreover, diversity combining is an effective means to reduce signal distortions due to reverberation and therefore improves the speech intelligibility compared to single channel noise reduction. This improvement can be explained by the fact that spectral combining equalizes frequency dips that occur only in one microphone input (cf. Figure 3).

The spectral combining requires an SNR estimate for each input signal. We have presented a simple noise PSD estimator that reliably approximates the noise power for stationary as well as instationary noise.

Declarations

Acknowledgments

Research for this paper was supported by the German Federal Ministry of Education and Research (Grant no. 17 N11 08). Last but not the least, the authors would like to thank the reviewers for their constructive comments and suggestions which greatly improve the quality of this paper.

Authors’ Affiliations

(1)
Department of Computer Science, University of Applied Sciences Konstanz, Hochschule Konstanz, Brauneggerstr. 55, 78462 Konstanz, Germany

References

  1. Hänsler E, Schmidt G: Acoustic Echo and Noise Control: A Practical Approach. John Wiley & Sons, New York, NY, USA; 2004.View ArticleGoogle Scholar
  2. Vary P, Martin R: Digital Speech Transmission: Enhancement, Coding and Error Concealment. John Wiley & Sons, New York, NY, USA; 2006.View ArticleGoogle Scholar
  3. Hänsler E, Schmidt G: Speech and Audio Processing in Adverse Environments: Signals and Communication Technologie. Springer, Berlin, Germany; 2008.View ArticleGoogle Scholar
  4. Boll S: Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209View ArticleGoogle Scholar
  5. Martin R, Vary P: A symmetric two microphone speech enhancement system theoretical limits and application in a car environment. Proceedings of the Digital Signal Processing Workshop, August 1992, Helsingoer, Denmark 451-452.Google Scholar
  6. Martin R, Vary P: Combined acoustic echo cancellation, dereverberation and noise reduction: a two microphone approach. Annales des Télécommunications 1994, 49(7-8):429-438.Google Scholar
  7. Azirani AA, Bouquin-Jeannès RL, Faucon G: Enhancement of speech degraded by coherent and incoherent noise using a cross-spectral estimator. IEEE Transactions on Speech and Audio Processing 1997, 5(5):484-487. 10.1109/89.622576View ArticleGoogle Scholar
  8. Guérin A, Bouquin-Jeannès RL, Faucon G: A two-sensor noise reduction system: applications for hands-free car kit. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1125-1134. 10.1155/S1110865703305098View ArticleMATHGoogle Scholar
  9. Freudenberger J, Linhard K: A two-microphone diversity system and its application for hands-free car kits. Proceedings of European Conference on Speech Communication and Technology (INTERSPEECH '05), September 2005, Lisbon, Portugal 2329-2332.Google Scholar
  10. Gerkmann T, Martin R: Soft decision combining for dual channel noise reduction. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH—ICSLP '06), September 2006, Pittsburgh, Pa, USA 5: 2134-2137.Google Scholar
  11. Freudenberger J, Stenzel S, Venditti B: Spectral combining for microphone diversity systems. Proceedings of European Signal Processing Conference (EUSIPCO '09), July 2009, Glasgow, UK 854-858.Google Scholar
  12. Flanagan JL, Lummis RC: Signal processing to reduce multipath distortion in small rooms. Journal of the Acoustical Society of America 1970, 47(6):1475-1481. 10.1121/1.1912067View ArticleGoogle Scholar
  13. Allen JB, Berkley DA, Blauert J: Multimicrophone signal-processing technique to remove room reverberation from speech signals. Journal of the Acoustical Society of America 1977, 62(4):912-915. 10.1121/1.381621View ArticleGoogle Scholar
  14. Gannot S, Moonen M: Subspace methods for multimicrophone speech dereverberation. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1074-1090. 10.1155/S1110865703305049View ArticleMATHGoogle Scholar
  15. Delcroix M, Hikichi T, Miyoshi M: Dereverberation and denoising using multichannel linear prediction. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(6):1791-1801.View ArticleMATHGoogle Scholar
  16. Ram I, Habets E, Avargel Y, Cohen I: Multi-microphone speech dereverberation using LIME and least squares filtering. Proceedings of European Signal Processing Conference (EUSIPCO '08), August 2008, Lausanne, SwitzerlandGoogle Scholar
  17. Mukherjee K, Gwee B-H: A 32-point FFT based noise reduction algorithm for single channel speech signals. Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS '07), May 2007, New Orleans, La, USA 3928-3931.Google Scholar
  18. Armbrüster W, Czarnach R, Vary P: Adaptive noise cancellation with reference input. In Signal Processing III. Elsevier; 1986:391-394.Google Scholar
  19. Sklar B: Digital Communications: Fundamentals and Applications. Prentice Hall, Upper Saddle River, NJ, USA; 2001.MATHGoogle Scholar
  20. Knapp C, Carter G: The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics Speech and Signal Processing 1976, 24(4):320-327. 10.1109/TASSP.1976.1162830View ArticleGoogle Scholar
  21. Freudenberger J, Stenzel S, Venditti B: An FLMS based two-microphone speech enhancement system for in-car applications. Proceedings of the 15th IEEE Workshop on Statistical Signal Processing (SSP '09), 2009 705-708.Google Scholar
  22. Berouti M, Schwartz R, Makhoul J: Enhancement of speech corrupted by acoustic noise. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '79), April 1979, Washington, DC, USA 208-211.View ArticleGoogle Scholar
  23. Martin R: Spectral subtraction based on minimum statistics. Proceedings of the European Signal Processing Conference (EUSIPCO '94), April 1994, Edinburgh, UK 1182-1185.Google Scholar
  24. Puder H: Single channel noise reduction using time-frequency dependent voice activity detection. Proceedings of International Workshop on Acoustic Echo and Noise Control (IWAENC '99), September 1999, Pocono Manor, Pa, USA 68-71.Google Scholar
  25. Juneja A, Deshmukh O, Espy-Wilson C: A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 4: 4160-4164.Google Scholar
  26. Ramírez J, Segura JC, Benítez C, de La Torre A, Rubio A: A new voice activity detector using subband order-statistics filters for robust speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '04), 2004 1: I849-I852.Google Scholar
  27. Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on Speech and Audio Processing 2001, 9(5):504-512. 10.1109/89.928915View ArticleGoogle Scholar
  28. Cohen I: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Transactions on Speech and Audio Processing 2003, 11(5):466-475. 10.1109/TSA.2003.811544View ArticleGoogle Scholar
  29. Sohn J, Sung W: A voice activity detector employing soft decision based noise spectrum adaptation. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '98), 1998 1: 365-368.Google Scholar
  30. Forney GD Jr.: Exponential error bounds for erasure, list, and decision feedback schemes. IEEE Transactions on Information Theory 1968, 14(2):206-220. 10.1109/TIT.1968.1054129MathSciNetView ArticleMATHGoogle Scholar
  31. Sohn J, Kim NS, Sung W: A statistical model-based voice activity detection. IEEE Signal Processing Letters 1999, 6(1):1-3. 10.1109/97.736233View ArticleGoogle Scholar
  32. ITU-T : Test signals for use in telephonometry, Recommendation ITU-T P.501. International Telecommunication Union, Geneva, Switzerland; 2007.Google Scholar
  33. Gray AH Jr., Markel JD: Distance measures for speech processing. IEEE Transactions on Acoustics, Speech and Signal Processing 1976, 24(5):380-391. 10.1109/TASSP.1976.1162849MathSciNetView ArticleGoogle Scholar

Copyright

Advertisement